You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ben Frank <be...@airlust.com> on 2012/08/29 21:32:02 UTC
Cassandra Hadoop integration issue using CFIF
Hey all,
I'm having an issue using ColumnFamilyInputFormat in an hadoop job. The
mappers spin out of control and just keep reading records over and over,
never getting to the end. I have CF with wide rows (although none is past
about 5 at the columns at the moment), I've tried setting wide rows to both
true and false. If I turn on debugging, I get what seems like strange input
splits created (see the -1):
hadoop.ColumnFamilyInputFormat: partitioner is
org.apache.cassandra.dht.RandomPartitioner@203727c5
hadoop.ColumnFamilyInputFormat: adding
ColumnFamilySplit((127605887595351923798765477786913079296, '-1] @[cass1,
cass2, cass3])
hadoop.ColumnFamilyInputFormat: adding ColumnFamilySplit((-1, '0] @[cass1,
cass2, cass3])
hadoop.ColumnFamilyInputFormat: adding ColumnFamilySplit((0,
'42535295865117307932921825928971026432] @[cass2, cass3, cass4])
hadoop.ColumnFamilyInputFormat: adding
ColumnFamilySplit((42535295865117307932921825928971026432,
'85070591730234615865843651857942052864] @[cass3, cass4, cass1])
hadoop.ColumnFamilyInputFormat: adding
ColumnFamilySplit((85070591730234615865843651857942052864,
'127605887595351923798765477786913079296] @[cass4, cass1, cass2])
If I debug in eclipse (with widerows=false) is see that this call in
ColumnFamilyRecordReader.StaticRowIterator.maybeInit() is setting
startToken to -1:
startToken = partitioner.getTokenFactory().toString(partitioner
.getToken(Iterables.getLast(rows).key));
I'm using cassandra 1.1.2 with a 4 node cluster, a replication factor of 3
and hadoop 0.20.1, here's the output of nodetool ring:
Address DC Rack Status State Load
Effective-Ownership Token
127605887595351923798765477786913079296
129.19.63.126 datacenter1 rack1 Up Normal 46.91 GB
75.00% 0
129.19.63.127 datacenter1 rack1 Up Normal 49.45 GB
75.00% 42535295865117307932921825928971026432
129.19.63.128 datacenter1 rack1 Up Normal 43.19 GB
75.00% 85070591730234615865843651857942052864
129.19.63.129 datacenter1 rack1 Up Normal 46.9 GB
75.00% 127605887595351923798765477786913079296
Anyone have any idea what's going on here, I'm assuming the splits are
wrong so I'm going to focus on seeing what's up with that, anything else I
should look at ?
-Ben
Re: Cassandra Hadoop integration issue using CFIF
Posted by Ben Frank <be...@airlust.com>.
This line always returns "0" because the key ByteBuffer has already been
read from.
startToken
= partitioner.getTokenFactory().toString(partitioner.getToken(Iterables.getLast(rows).key));
I was able to get it to work by using .mark() and .reset() on the buffer.
I'll log a bug, but confused as to why no one else is running into this.
-Ben
On Wed, Aug 29, 2012 at 12:32 PM, Ben Frank <be...@airlust.com> wrote:
> Hey all,
> I'm having an issue using ColumnFamilyInputFormat in an hadoop job.
> The mappers spin out of control and just keep reading records over and
> over, never getting to the end. I have CF with wide rows (although none is
> past about 5 at the columns at the moment), I've tried setting wide rows to
> both true and false. If I turn on debugging, I get what seems like strange
> input splits created (see the -1):
>
> hadoop.ColumnFamilyInputFormat: partitioner is
> org.apache.cassandra.dht.RandomPartitioner@203727c5
> hadoop.ColumnFamilyInputFormat: adding
> ColumnFamilySplit((127605887595351923798765477786913079296, '-1] @[cass1,
> cass2, cass3])
> hadoop.ColumnFamilyInputFormat: adding ColumnFamilySplit((-1, '0] @[cass1,
> cass2, cass3])
> hadoop.ColumnFamilyInputFormat: adding ColumnFamilySplit((0,
> '42535295865117307932921825928971026432] @[cass2, cass3, cass4])
> hadoop.ColumnFamilyInputFormat: adding
> ColumnFamilySplit((42535295865117307932921825928971026432,
> '85070591730234615865843651857942052864] @[cass3, cass4, cass1])
> hadoop.ColumnFamilyInputFormat: adding
> ColumnFamilySplit((85070591730234615865843651857942052864,
> '127605887595351923798765477786913079296] @[cass4, cass1, cass2])
>
> If I debug in eclipse (with widerows=false) is see that this call in
> ColumnFamilyRecordReader.StaticRowIterator.maybeInit() is setting
> startToken to -1:
>
> startToken = partitioner.getTokenFactory().toString(partitioner
> .getToken(Iterables.getLast(rows).key));
>
> I'm using cassandra 1.1.2 with a 4 node cluster, a replication factor of 3
> and hadoop 0.20.1, here's the output of nodetool ring:
>
> Address DC Rack Status State Load
> Effective-Ownership Token
>
>
> 127605887595351923798765477786913079296
>
> 129.19.63.126 datacenter1 rack1 Up Normal 46.91 GB
> 75.00% 0
>
> 129.19.63.127 datacenter1 rack1 Up Normal 49.45 GB
> 75.00% 42535295865117307932921825928971026432
>
> 129.19.63.128 datacenter1 rack1 Up Normal 43.19 GB
> 75.00% 85070591730234615865843651857942052864
>
> 129.19.63.129 datacenter1 rack1 Up Normal 46.9 GB
> 75.00% 127605887595351923798765477786913079296
>
> Anyone have any idea what's going on here, I'm assuming the splits are
> wrong so I'm going to focus on seeing what's up with that, anything else I
> should look at ?
>
> -Ben
>