You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joost Ouwerkerk <jo...@openplaces.org> on 2010/04/18 21:59:38 UTC

Help with MapReduce

I'm a Cassandra noob trying to validate Cassandra as a viable alternative to
HBase (which we've been using for over a year) for our application.  So far,
I've had no success getting Cassandra working with MapReduce.

My first step is inserting data into Cassandra.  I've created a MapRed job
based using the fat client API.  I'm using the fat client (StorageProxy)
because that's what ColumnFamilyInputFormat uses and I want to use the same
API for both read and write jobs.

When I call StorageProxy.mutate(), nothing happens.  The job completes as if
it had done something, but in fact nothing has changed in the cluster.  When
I call StorageProxy.mutateBlocking(), I get an IOException complaining that
there is no connection to the cluster.  I've concluded with the debugger
that StorageService is not connecting to the cluster, even though I've
specified the correct seed and ListenAddress (I've using the exact same
storage-conf.xml as the nodes in the cluster).

I'm sure I'm missing something obvious in the configuration or my setup, but
since I'm new to Cassandra, I can't see what it is.

Any help appreciated,
Joost

Re: Help with MapReduce

Posted by Jonathan Ellis <jb...@gmail.com>.

http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts

On Tue, Apr 20, 2010 at 12:48 AM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> Ok.  This should be ok for now, although not optimal for some jobs.
>
> Next issue is node stability during the insert job.  The stacktrace below
> occured on several nodes while inserting 10 million rows.  We're running on
> 4G machines, 1G of which is allocated to cassandra.  What's the best config
> to prevent OOMs (even if it means sacrificing some performance)?
>
> ERROR [COMPACTION-POOL:1] 2010-04-20 01:39:15,853
> DebuggableThreadPoolExecutor.java (line 94) Error in executor
> futuretaskjava.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
>         at
> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)        at
> java.util.concurrent.FutureTask.get(FutureTask.java:83)
>         at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
> at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2786)        at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
> at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>         at
> org.apache.cassandra.db.ColumnSerializer.writeName(ColumnSerializer.java:39)
>         at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:301)
>         at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
> at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
>         at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
> at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:131)
>         at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:41)
>         at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>         at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>         at
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
> at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>         at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:284)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>         at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:83)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         ... 2 more
>
>
> On Mon, Apr 19, 2010 at 10:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Oh, from Hadoop.  Yes, you are indeed limited to entire columns or
>> supercolumns at a time there.
>

Re: Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.org>.

Ok.  This should be ok for now, although not optimal for some jobs.

Next issue is node stability during the insert job.  The stacktrace below
occured on several nodes while inserting 10 million rows.  We're running on
4G machines, 1G of which is allocated to cassandra.  What's the best config
to prevent OOMs (even if it means sacrificing some performance)?

ERROR [COMPACTION-POOL:1] 2010-04-20 01:39:15,853
DebuggableThreadPoolExecutor.java (line 94) Error in executor
futuretaskjava.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: Java heap space
        at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)        at
java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
at
org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)        at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
        at
org.apache.cassandra.db.ColumnSerializer.writeName(ColumnSerializer.java:39)
        at
org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:301)
        at
org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
        at
org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
at
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:131)
        at
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:41)
        at
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
        at
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
        at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:284)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
        at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:83)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        ... 2 more


On Mon, Apr 19, 2010 at 10:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Oh, from Hadoop.  Yes, you are indeed limited to entire columns or
> supercolumns at a time there.
>

Re: Help with MapReduce

Posted by Jonathan Ellis <jb...@gmail.com>.

yes

On 4/19/10, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> And when retrieving only one supercolumn?  Can I further specify which
> subcolumns to retrieve?
>
> On Mon, Apr 19, 2010 at 9:29 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> the latter, if you are retrieving multiple supercolumns.
>>
>> On Mon, Apr 19, 2010 at 8:10 PM, Joost Ouwerkerk <jo...@openplaces.org>
>> wrote:
>> > hmm, might be too much data.  In the case of a supercolumn, how do I
>> specify
>> > which sub-columns to retrieve?  Or can I only retrieve entire
>> supercolumns?
>> > On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >>
>> >> Possibly you are asking it to retrieve too many columns per row.
>> >>
>> >> Possibly there is something else causing poor performance, like
>> swapping.
>> >>
>> >> On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk <jo...@openplaces.org>
>> >> wrote:
>> >> > I'm slowly getting somewhere with Cassandra... I have successfully
>> >> > imported
>> >> > 1.5 million rows using MapReduce.  This took about 8 minutes on an
>> >> > 8-node
>> >> > cluster, which is comparable to the time it takes with HBase.
>> >> > Now I'm having trouble scanning this data.  I've created a simple
>> >> > MapReduce
>> >> > job that counts rows in my ColumnFamily.  The Job fails with most
>> tasks
>> >> > throwing the following Exception.  Anyone have any ideas what's going
>> >> > wrong?
>> >> > java.lang.RuntimeException: TimedOutException()
>> >> >
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
>> >> >       at
>> >> >
>> >> >
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>> >> >       at
>> >> >
>> >> >
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
>> >> >       at
>> >> >
>> >> >
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>> >> >       at
>> >> >
>> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>> >> >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>> >> >       at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>> >> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> >> >       at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> >> > Caused by: TimedOutException()
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
>> >> >       at
>> >> >
>> >> >
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
>> >> >       ... 11 more
>> >> >
>> >> > On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com>
>> >> > wrote:
>> >> >>
>> >> >> In 0.6.0 and trunk, it is located at
>> >> >> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>> >> >>
>> >> >> You might be using a pre-release version of 0.6 if you are seeing a
>> fat
>> >> >> client based InputFormat.
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> >> >> Sent: Sunday, April 18, 2010 4:53pm
>> >> >> To: user@cassandra.apache.org
>> >> >> Subject: Re: Help with MapReduce
>> >> >>
>> >> >> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't
>> >> >> actually
>> >> >> have a preference about client, I just want to be consistent with
>> >> >> ColumnInputFormat.
>> >> >>
>> >> >> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com>
>> >> >> wrote:
>> >> >>
>> >> >> > ColumnFamilyInputFormat no longer uses the fat client API, and
>> >> >> > instead
>> >> >> > uses
>> >> >> > Thrift. There are still some significant problems with the fat
>> >> >> > client,
>> >> >> > so it
>> >> >> > shouldn't be used without a good understanding of those problems.
>> >> >> >
>> >> >> > If you still want to use it, check out contrib/bmt_example, but
>> >> >> > I'd
>> >> >> > recommend that you use thrift for now.
>> >> >> >
>> >> >> > -----Original Message-----
>> >> >> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> >> >> > Sent: Sunday, April 18, 2010 2:59pm
>> >> >> > To: user@cassandra.apache.org
>> >> >> > Subject: Help with MapReduce
>> >> >> >
>> >> >> > I'm a Cassandra noob trying to validate Cassandra as a viable
>> >> >> > alternative
>> >> >> > to
>> >> >> > HBase (which we've been using for over a year) for our
>> >> >> > application.
>> >> >> >  So
>> >> >> > far,
>> >> >> > I've had no success getting Cassandra working with MapReduce.
>> >> >> >
>> >> >> > My first step is inserting data into Cassandra.  I've created a
>> >> >> > MapRed
>> >> >> > job
>> >> >> > based using the fat client API.  I'm using the fat client
>> >> >> > (StorageProxy)
>> >> >> > because that's what ColumnFamilyInputFormat uses and I want to use
>> >> >> > the
>> >> >> > same
>> >> >> > API for both read and write jobs.
>> >> >> >
>> >> >> > When I call StorageProxy.mutate(), nothing happens.  The job
>> >> >> > completes
>> >> >> > as
>> >> >> > if
>> >> >> > it had done something, but in fact nothing has changed in the
>> >> >> > cluster.
>> >> >> >  When
>> >> >> > I call StorageProxy.mutateBlocking(), I get an IOException
>> >> >> > complaining
>> >> >> > that
>> >> >> > there is no connection to the cluster.  I've concluded with the
>> >> >> > debugger
>> >> >> > that StorageService is not connecting to the cluster, even though
>> >> >> > I've
>> >> >> > specified the correct seed and ListenAddress (I've using the exact
>> >> >> > same
>> >> >> > storage-conf.xml as the nodes in the cluster).
>> >> >> >
>> >> >> > I'm sure I'm missing something obvious in the configuration or my
>> >> >> > setup,
>> >> >> > but
>> >> >> > since I'm new to Cassandra, I can't see what it is.
>> >> >> >
>> >> >> > Any help appreciated,
>> >> >> > Joost
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >
>> >
>>
>

Re: Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.org>.

And when retrieving only one supercolumn?  Can I further specify which
subcolumns to retrieve?

On Mon, Apr 19, 2010 at 9:29 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> the latter, if you are retrieving multiple supercolumns.
>
> On Mon, Apr 19, 2010 at 8:10 PM, Joost Ouwerkerk <jo...@openplaces.org>
> wrote:
> > hmm, might be too much data.  In the case of a supercolumn, how do I
> specify
> > which sub-columns to retrieve?  Or can I only retrieve entire
> supercolumns?
> > On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> Possibly you are asking it to retrieve too many columns per row.
> >>
> >> Possibly there is something else causing poor performance, like
> swapping.
> >>
> >> On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk <jo...@openplaces.org>
> >> wrote:
> >> > I'm slowly getting somewhere with Cassandra... I have successfully
> >> > imported
> >> > 1.5 million rows using MapReduce.  This took about 8 minutes on an
> >> > 8-node
> >> > cluster, which is comparable to the time it takes with HBase.
> >> > Now I'm having trouble scanning this data.  I've created a simple
> >> > MapReduce
> >> > job that counts rows in my ColumnFamily.  The Job fails with most
> tasks
> >> > throwing the following Exception.  Anyone have any ideas what's going
> >> > wrong?
> >> > java.lang.RuntimeException: TimedOutException()
> >> >
> >> >       at
> >> >
> >> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
> >> >       at
> >> >
> >> >
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> >> >       at
> >> >
> >> >
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
> >> >       at
> >> >
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >> >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >> >       at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> >> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >> >       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >> > Caused by: TimedOutException()
> >> >       at
> >> >
> >> >
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
> >> >       at
> >> >
> >> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
> >> >       ... 11 more
> >> >
> >> > On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com>
> >> > wrote:
> >> >>
> >> >> In 0.6.0 and trunk, it is located at
> >> >> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
> >> >>
> >> >> You might be using a pre-release version of 0.6 if you are seeing a
> fat
> >> >> client based InputFormat.
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: "Joost Ouwerkerk" <jo...@openplaces.org>
> >> >> Sent: Sunday, April 18, 2010 4:53pm
> >> >> To: user@cassandra.apache.org
> >> >> Subject: Re: Help with MapReduce
> >> >>
> >> >> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't
> >> >> actually
> >> >> have a preference about client, I just want to be consistent with
> >> >> ColumnInputFormat.
> >> >>
> >> >> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com>
> >> >> wrote:
> >> >>
> >> >> > ColumnFamilyInputFormat no longer uses the fat client API, and
> >> >> > instead
> >> >> > uses
> >> >> > Thrift. There are still some significant problems with the fat
> >> >> > client,
> >> >> > so it
> >> >> > shouldn't be used without a good understanding of those problems.
> >> >> >
> >> >> > If you still want to use it, check out contrib/bmt_example, but I'd
> >> >> > recommend that you use thrift for now.
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
> >> >> > Sent: Sunday, April 18, 2010 2:59pm
> >> >> > To: user@cassandra.apache.org
> >> >> > Subject: Help with MapReduce
> >> >> >
> >> >> > I'm a Cassandra noob trying to validate Cassandra as a viable
> >> >> > alternative
> >> >> > to
> >> >> > HBase (which we've been using for over a year) for our application.
> >> >> >  So
> >> >> > far,
> >> >> > I've had no success getting Cassandra working with MapReduce.
> >> >> >
> >> >> > My first step is inserting data into Cassandra.  I've created a
> >> >> > MapRed
> >> >> > job
> >> >> > based using the fat client API.  I'm using the fat client
> >> >> > (StorageProxy)
> >> >> > because that's what ColumnFamilyInputFormat uses and I want to use
> >> >> > the
> >> >> > same
> >> >> > API for both read and write jobs.
> >> >> >
> >> >> > When I call StorageProxy.mutate(), nothing happens.  The job
> >> >> > completes
> >> >> > as
> >> >> > if
> >> >> > it had done something, but in fact nothing has changed in the
> >> >> > cluster.
> >> >> >  When
> >> >> > I call StorageProxy.mutateBlocking(), I get an IOException
> >> >> > complaining
> >> >> > that
> >> >> > there is no connection to the cluster.  I've concluded with the
> >> >> > debugger
> >> >> > that StorageService is not connecting to the cluster, even though
> >> >> > I've
> >> >> > specified the correct seed and ListenAddress (I've using the exact
> >> >> > same
> >> >> > storage-conf.xml as the nodes in the cluster).
> >> >> >
> >> >> > I'm sure I'm missing something obvious in the configuration or my
> >> >> > setup,
> >> >> > but
> >> >> > since I'm new to Cassandra, I can't see what it is.
> >> >> >
> >> >> > Any help appreciated,
> >> >> > Joost
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >
> >> >
> >
> >
>

Re: Help with MapReduce

Posted by Jonathan Ellis <jb...@gmail.com>.

the latter, if you are retrieving multiple supercolumns.

On Mon, Apr 19, 2010 at 8:10 PM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> hmm, might be too much data.  In the case of a supercolumn, how do I specify
> which sub-columns to retrieve?  Or can I only retrieve entire supercolumns?
> On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Possibly you are asking it to retrieve too many columns per row.
>>
>> Possibly there is something else causing poor performance, like swapping.
>>
>> On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk <jo...@openplaces.org>
>> wrote:
>> > I'm slowly getting somewhere with Cassandra... I have successfully
>> > imported
>> > 1.5 million rows using MapReduce.  This took about 8 minutes on an
>> > 8-node
>> > cluster, which is comparable to the time it takes with HBase.
>> > Now I'm having trouble scanning this data.  I've created a simple
>> > MapReduce
>> > job that counts rows in my ColumnFamily.  The Job fails with most tasks
>> > throwing the following Exception.  Anyone have any ideas what's going
>> > wrong?
>> > java.lang.RuntimeException: TimedOutException()
>> >
>> >       at
>> >
>> > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
>> >       at
>> >
>> > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
>> >       at
>> >
>> > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
>> >       at
>> >
>> > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>> >       at
>> >
>> > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>> >       at
>> >
>> > org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
>> >       at
>> >
>> > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>> >       at
>> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>> >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>> >       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>> >       at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> > Caused by: TimedOutException()
>> >       at
>> >
>> > org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
>> >       at
>> >
>> > org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
>> >       at
>> >
>> > org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
>> >       at
>> >
>> > org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
>> >       ... 11 more
>> >
>> > On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com>
>> > wrote:
>> >>
>> >> In 0.6.0 and trunk, it is located at
>> >> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>> >>
>> >> You might be using a pre-release version of 0.6 if you are seeing a fat
>> >> client based InputFormat.
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> >> Sent: Sunday, April 18, 2010 4:53pm
>> >> To: user@cassandra.apache.org
>> >> Subject: Re: Help with MapReduce
>> >>
>> >> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't
>> >> actually
>> >> have a preference about client, I just want to be consistent with
>> >> ColumnInputFormat.
>> >>
>> >> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com>
>> >> wrote:
>> >>
>> >> > ColumnFamilyInputFormat no longer uses the fat client API, and
>> >> > instead
>> >> > uses
>> >> > Thrift. There are still some significant problems with the fat
>> >> > client,
>> >> > so it
>> >> > shouldn't be used without a good understanding of those problems.
>> >> >
>> >> > If you still want to use it, check out contrib/bmt_example, but I'd
>> >> > recommend that you use thrift for now.
>> >> >
>> >> > -----Original Message-----
>> >> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> >> > Sent: Sunday, April 18, 2010 2:59pm
>> >> > To: user@cassandra.apache.org
>> >> > Subject: Help with MapReduce
>> >> >
>> >> > I'm a Cassandra noob trying to validate Cassandra as a viable
>> >> > alternative
>> >> > to
>> >> > HBase (which we've been using for over a year) for our application.
>> >> >  So
>> >> > far,
>> >> > I've had no success getting Cassandra working with MapReduce.
>> >> >
>> >> > My first step is inserting data into Cassandra.  I've created a
>> >> > MapRed
>> >> > job
>> >> > based using the fat client API.  I'm using the fat client
>> >> > (StorageProxy)
>> >> > because that's what ColumnFamilyInputFormat uses and I want to use
>> >> > the
>> >> > same
>> >> > API for both read and write jobs.
>> >> >
>> >> > When I call StorageProxy.mutate(), nothing happens.  The job
>> >> > completes
>> >> > as
>> >> > if
>> >> > it had done something, but in fact nothing has changed in the
>> >> > cluster.
>> >> >  When
>> >> > I call StorageProxy.mutateBlocking(), I get an IOException
>> >> > complaining
>> >> > that
>> >> > there is no connection to the cluster.  I've concluded with the
>> >> > debugger
>> >> > that StorageService is not connecting to the cluster, even though
>> >> > I've
>> >> > specified the correct seed and ListenAddress (I've using the exact
>> >> > same
>> >> > storage-conf.xml as the nodes in the cluster).
>> >> >
>> >> > I'm sure I'm missing something obvious in the configuration or my
>> >> > setup,
>> >> > but
>> >> > since I'm new to Cassandra, I can't see what it is.
>> >> >
>> >> > Any help appreciated,
>> >> > Joost
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>
>

Re: Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.org>.

hmm, might be too much data.  In the case of a supercolumn, how do I specify
which sub-columns to retrieve?  Or can I only retrieve entire supercolumns?

On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Possibly you are asking it to retrieve too many columns per row.
>
> Possibly there is something else causing poor performance, like swapping.
>
> On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk <jo...@openplaces.org>
> wrote:
> > I'm slowly getting somewhere with Cassandra... I have successfully
> imported
> > 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
> > cluster, which is comparable to the time it takes with HBase.
> > Now I'm having trouble scanning this data.  I've created a simple
> MapReduce
> > job that counts rows in my ColumnFamily.  The Job fails with most tasks
> > throwing the following Exception.  Anyone have any ideas what's going
> wrong?
> > java.lang.RuntimeException: TimedOutException()
> >
> >       at
> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
> >       at
> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
> >       at
> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
> >       at
> >
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> >       at
> >
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> >       at
> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
> >       at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
> >       at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> >       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> > Caused by: TimedOutException()
> >       at
> >
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
> >       at
> >
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
> >       at
> >
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
> >       at
> >
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
> >       ... 11 more
> >
> > On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com>
> wrote:
> >>
> >> In 0.6.0 and trunk, it is located at
> >> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
> >>
> >> You might be using a pre-release version of 0.6 if you are seeing a fat
> >> client based InputFormat.
> >>
> >>
> >> -----Original Message-----
> >> From: "Joost Ouwerkerk" <jo...@openplaces.org>
> >> Sent: Sunday, April 18, 2010 4:53pm
> >> To: user@cassandra.apache.org
> >> Subject: Re: Help with MapReduce
> >>
> >> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
> >> have a preference about client, I just want to be consistent with
> >> ColumnInputFormat.
> >>
> >> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com>
> wrote:
> >>
> >> > ColumnFamilyInputFormat no longer uses the fat client API, and instead
> >> > uses
> >> > Thrift. There are still some significant problems with the fat client,
> >> > so it
> >> > shouldn't be used without a good understanding of those problems.
> >> >
> >> > If you still want to use it, check out contrib/bmt_example, but I'd
> >> > recommend that you use thrift for now.
> >> >
> >> > -----Original Message-----
> >> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
> >> > Sent: Sunday, April 18, 2010 2:59pm
> >> > To: user@cassandra.apache.org
> >> > Subject: Help with MapReduce
> >> >
> >> > I'm a Cassandra noob trying to validate Cassandra as a viable
> >> > alternative
> >> > to
> >> > HBase (which we've been using for over a year) for our application.
>  So
> >> > far,
> >> > I've had no success getting Cassandra working with MapReduce.
> >> >
> >> > My first step is inserting data into Cassandra.  I've created a MapRed
> >> > job
> >> > based using the fat client API.  I'm using the fat client
> (StorageProxy)
> >> > because that's what ColumnFamilyInputFormat uses and I want to use the
> >> > same
> >> > API for both read and write jobs.
> >> >
> >> > When I call StorageProxy.mutate(), nothing happens.  The job completes
> >> > as
> >> > if
> >> > it had done something, but in fact nothing has changed in the cluster.
> >> >  When
> >> > I call StorageProxy.mutateBlocking(), I get an IOException complaining
> >> > that
> >> > there is no connection to the cluster.  I've concluded with the
> debugger
> >> > that StorageService is not connecting to the cluster, even though I've
> >> > specified the correct seed and ListenAddress (I've using the exact
> same
> >> > storage-conf.xml as the nodes in the cluster).
> >> >
> >> > I'm sure I'm missing something obvious in the configuration or my
> setup,
> >> > but
> >> > since I'm new to Cassandra, I can't see what it is.
> >> >
> >> > Any help appreciated,
> >> > Joost
> >> >
> >> >
> >> >
> >>
> >>
> >
> >
>

Re: Help with MapReduce

Posted by Jonathan Ellis <jb...@gmail.com>.

Possibly you are asking it to retrieve too many columns per row.

Possibly there is something else causing poor performance, like swapping.

On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> I'm slowly getting somewhere with Cassandra... I have successfully imported
> 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
> cluster, which is comparable to the time it takes with HBase.
> Now I'm having trouble scanning this data.  I've created a simple MapReduce
> job that counts rows in my ColumnFamily.  The Job fails with most tasks
> throwing the following Exception.  Anyone have any ideas what's going wrong?
> java.lang.RuntimeException: TimedOutException()
>
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
> 	at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> 	at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
> 	at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
> 	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: TimedOutException()
> 	at
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
> 	at
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
> 	at
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
> 	... 11 more
>
> On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com> wrote:
>>
>> In 0.6.0 and trunk, it is located at
>> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>>
>> You might be using a pre-release version of 0.6 if you are seeing a fat
>> client based InputFormat.
>>
>>
>> -----Original Message-----
>> From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> Sent: Sunday, April 18, 2010 4:53pm
>> To: user@cassandra.apache.org
>> Subject: Re: Help with MapReduce
>>
>> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
>> have a preference about client, I just want to be consistent with
>> ColumnInputFormat.
>>
>> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:
>>
>> > ColumnFamilyInputFormat no longer uses the fat client API, and instead
>> > uses
>> > Thrift. There are still some significant problems with the fat client,
>> > so it
>> > shouldn't be used without a good understanding of those problems.
>> >
>> > If you still want to use it, check out contrib/bmt_example, but I'd
>> > recommend that you use thrift for now.
>> >
>> > -----Original Message-----
>> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> > Sent: Sunday, April 18, 2010 2:59pm
>> > To: user@cassandra.apache.org
>> > Subject: Help with MapReduce
>> >
>> > I'm a Cassandra noob trying to validate Cassandra as a viable
>> > alternative
>> > to
>> > HBase (which we've been using for over a year) for our application.  So
>> > far,
>> > I've had no success getting Cassandra working with MapReduce.
>> >
>> > My first step is inserting data into Cassandra.  I've created a MapRed
>> > job
>> > based using the fat client API.  I'm using the fat client (StorageProxy)
>> > because that's what ColumnFamilyInputFormat uses and I want to use the
>> > same
>> > API for both read and write jobs.
>> >
>> > When I call StorageProxy.mutate(), nothing happens.  The job completes
>> > as
>> > if
>> > it had done something, but in fact nothing has changed in the cluster.
>> >  When
>> > I call StorageProxy.mutateBlocking(), I get an IOException complaining
>> > that
>> > there is no connection to the cluster.  I've concluded with the debugger
>> > that StorageService is not connecting to the cluster, even though I've
>> > specified the correct seed and ListenAddress (I've using the exact same
>> > storage-conf.xml as the nodes in the cluster).
>> >
>> > I'm sure I'm missing something obvious in the configuration or my setup,
>> > but
>> > since I'm new to Cassandra, I can't see what it is.
>> >
>> > Any help appreciated,
>> > Joost
>> >
>> >
>> >
>>
>>
>
>

Re: Help with MapReduce

Posted by Jesse McConnell <je...@gmail.com>.

err not count in your case, but same symptom, cassandra can't return
the answer to your query in the configured rpctimeout time

cheers,
jesse

--
jesse mcconnell
jesse.mcconnell@gmail.com



On Mon, Apr 19, 2010 at 19:40, Jesse McConnell
<je...@gmail.com> wrote:
> most likely means that the count() operation is taking too long for
> the configured RPCTimeout
>
> counts get unreliable after a certain number of columns under a key in
> my experience
>
> jesse
>
> --
> jesse mcconnell
> jesse.mcconnell@gmail.com
>
>
>
> On Mon, Apr 19, 2010 at 19:12, Joost Ouwerkerk <jo...@openplaces.org> wrote:
>> I'm slowly getting somewhere with Cassandra... I have successfully imported
>> 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
>> cluster, which is comparable to the time it takes with HBase.
>> Now I'm having trouble scanning this data.  I've created a simple MapReduce
>> job that counts rows in my ColumnFamily.  The Job fails with most tasks
>> throwing the following Exception.  Anyone have any ideas what's going wrong?
>> java.lang.RuntimeException: TimedOutException()
>>
>>       at
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
>>       at
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
>>       at
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
>>       at
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>       at
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>       at
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
>>       at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>>       at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> Caused by: TimedOutException()
>>       at
>> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
>>       at
>> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
>>       at
>> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
>>       at
>> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
>>       ... 11 more
>>
>> On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com> wrote:
>>>
>>> In 0.6.0 and trunk, it is located at
>>> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>>>
>>> You might be using a pre-release version of 0.6 if you are seeing a fat
>>> client based InputFormat.
>>>
>>>
>>> -----Original Message-----
>>> From: "Joost Ouwerkerk" <jo...@openplaces.org>
>>> Sent: Sunday, April 18, 2010 4:53pm
>>> To: user@cassandra.apache.org
>>> Subject: Re: Help with MapReduce
>>>
>>> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
>>> have a preference about client, I just want to be consistent with
>>> ColumnInputFormat.
>>>
>>> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:
>>>
>>> > ColumnFamilyInputFormat no longer uses the fat client API, and instead
>>> > uses
>>> > Thrift. There are still some significant problems with the fat client,
>>> > so it
>>> > shouldn't be used without a good understanding of those problems.
>>> >
>>> > If you still want to use it, check out contrib/bmt_example, but I'd
>>> > recommend that you use thrift for now.
>>> >
>>> > -----Original Message-----
>>> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
>>> > Sent: Sunday, April 18, 2010 2:59pm
>>> > To: user@cassandra.apache.org
>>> > Subject: Help with MapReduce
>>> >
>>> > I'm a Cassandra noob trying to validate Cassandra as a viable
>>> > alternative
>>> > to
>>> > HBase (which we've been using for over a year) for our application.  So
>>> > far,
>>> > I've had no success getting Cassandra working with MapReduce.
>>> >
>>> > My first step is inserting data into Cassandra.  I've created a MapRed
>>> > job
>>> > based using the fat client API.  I'm using the fat client (StorageProxy)
>>> > because that's what ColumnFamilyInputFormat uses and I want to use the
>>> > same
>>> > API for both read and write jobs.
>>> >
>>> > When I call StorageProxy.mutate(), nothing happens.  The job completes
>>> > as
>>> > if
>>> > it had done something, but in fact nothing has changed in the cluster.
>>> >  When
>>> > I call StorageProxy.mutateBlocking(), I get an IOException complaining
>>> > that
>>> > there is no connection to the cluster.  I've concluded with the debugger
>>> > that StorageService is not connecting to the cluster, even though I've
>>> > specified the correct seed and ListenAddress (I've using the exact same
>>> > storage-conf.xml as the nodes in the cluster).
>>> >
>>> > I'm sure I'm missing something obvious in the configuration or my setup,
>>> > but
>>> > since I'm new to Cassandra, I can't see what it is.
>>> >
>>> > Any help appreciated,
>>> > Joost
>>> >
>>> >
>>> >
>>>
>>>
>>
>>
>

Re: Help with MapReduce

Posted by Jesse McConnell <je...@gmail.com>.

most likely means that the count() operation is taking too long for
the configured RPCTimeout

counts get unreliable after a certain number of columns under a key in
my experience

jesse

--
jesse mcconnell
jesse.mcconnell@gmail.com



On Mon, Apr 19, 2010 at 19:12, Joost Ouwerkerk <jo...@openplaces.org> wrote:
> I'm slowly getting somewhere with Cassandra... I have successfully imported
> 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
> cluster, which is comparable to the time it takes with HBase.
> Now I'm having trouble scanning this data.  I've created a simple MapReduce
> job that counts rows in my ColumnFamily.  The Job fails with most tasks
> throwing the following Exception.  Anyone have any ideas what's going wrong?
> java.lang.RuntimeException: TimedOutException()
>
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
> 	at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> 	at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
> 	at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
> 	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: TimedOutException()
> 	at
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
> 	at
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
> 	at
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
> 	at
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
> 	... 11 more
>
> On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com> wrote:
>>
>> In 0.6.0 and trunk, it is located at
>> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>>
>> You might be using a pre-release version of 0.6 if you are seeing a fat
>> client based InputFormat.
>>
>>
>> -----Original Message-----
>> From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> Sent: Sunday, April 18, 2010 4:53pm
>> To: user@cassandra.apache.org
>> Subject: Re: Help with MapReduce
>>
>> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
>> have a preference about client, I just want to be consistent with
>> ColumnInputFormat.
>>
>> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:
>>
>> > ColumnFamilyInputFormat no longer uses the fat client API, and instead
>> > uses
>> > Thrift. There are still some significant problems with the fat client,
>> > so it
>> > shouldn't be used without a good understanding of those problems.
>> >
>> > If you still want to use it, check out contrib/bmt_example, but I'd
>> > recommend that you use thrift for now.
>> >
>> > -----Original Message-----
>> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
>> > Sent: Sunday, April 18, 2010 2:59pm
>> > To: user@cassandra.apache.org
>> > Subject: Help with MapReduce
>> >
>> > I'm a Cassandra noob trying to validate Cassandra as a viable
>> > alternative
>> > to
>> > HBase (which we've been using for over a year) for our application.  So
>> > far,
>> > I've had no success getting Cassandra working with MapReduce.
>> >
>> > My first step is inserting data into Cassandra.  I've created a MapRed
>> > job
>> > based using the fat client API.  I'm using the fat client (StorageProxy)
>> > because that's what ColumnFamilyInputFormat uses and I want to use the
>> > same
>> > API for both read and write jobs.
>> >
>> > When I call StorageProxy.mutate(), nothing happens.  The job completes
>> > as
>> > if
>> > it had done something, but in fact nothing has changed in the cluster.
>> >  When
>> > I call StorageProxy.mutateBlocking(), I get an IOException complaining
>> > that
>> > there is no connection to the cluster.  I've concluded with the debugger
>> > that StorageService is not connecting to the cluster, even though I've
>> > specified the correct seed and ListenAddress (I've using the exact same
>> > storage-conf.xml as the nodes in the cluster).
>> >
>> > I'm sure I'm missing something obvious in the configuration or my setup,
>> > but
>> > since I'm new to Cassandra, I can't see what it is.
>> >
>> > Any help appreciated,
>> > Joost
>> >
>> >
>> >
>>
>>
>
>

Re: Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.org>.

I'm slowly getting somewhere with Cassandra... I have successfully imported
1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
cluster, which is comparable to the time it takes with HBase.

Now I'm having trouble scanning this data.  I've created a simple MapReduce
job that counts rows in my ColumnFamily.  The Job fails with most tasks
throwing the following Exception.  Anyone have any ideas what's going wrong?

java.lang.RuntimeException: TimedOutException()

	at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
	at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
	at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
	at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: TimedOutException()
	at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
	at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
	at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
	at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
	... 11 more


On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood <st...@rackspace.com> wrote:

In 0.6.0 and trunk, it is located at
> src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java
>
> You might be using a pre-release version of 0.6 if you are seeing a fat
> client based InputFormat.
>
>
> -----Original Message-----
> From: "Joost Ouwerkerk" <jo...@openplaces.org>
> Sent: Sunday, April 18, 2010 4:53pm
> To: user@cassandra.apache.org
> Subject: Re: Help with MapReduce
>
> Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
> have a preference about client, I just want to be consistent with
> ColumnInputFormat.
>
> On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:
>
> > ColumnFamilyInputFormat no longer uses the fat client API, and instead
> uses
> > Thrift. There are still some significant problems with the fat client, so
> it
> > shouldn't be used without a good understanding of those problems.
> >
> > If you still want to use it, check out contrib/bmt_example, but I'd
> > recommend that you use thrift for now.
> >
> > -----Original Message-----
> > From: "Joost Ouwerkerk" <jo...@openplaces.org>
> > Sent: Sunday, April 18, 2010 2:59pm
> > To: user@cassandra.apache.org
> > Subject: Help with MapReduce
> >
> > I'm a Cassandra noob trying to validate Cassandra as a viable alternative
> > to
> > HBase (which we've been using for over a year) for our application.  So
> > far,
> > I've had no success getting Cassandra working with MapReduce.
> >
> > My first step is inserting data into Cassandra.  I've created a MapRed
> job
> > based using the fat client API.  I'm using the fat client (StorageProxy)
> > because that's what ColumnFamilyInputFormat uses and I want to use the
> same
> > API for both read and write jobs.
> >
> > When I call StorageProxy.mutate(), nothing happens.  The job completes as
> > if
> > it had done something, but in fact nothing has changed in the cluster.
> >  When
> > I call StorageProxy.mutateBlocking(), I get an IOException complaining
> that
> > there is no connection to the cluster.  I've concluded with the debugger
> > that StorageService is not connecting to the cluster, even though I've
> > specified the correct seed and ListenAddress (I've using the exact same
> > storage-conf.xml as the nodes in the cluster).
> >
> > I'm sure I'm missing something obvious in the configuration or my setup,
> > but
> > since I'm new to Cassandra, I can't see what it is.
> >
> > Any help appreciated,
> > Joost
> >
> >
> >
>
>
>

Re: Help with MapReduce

Posted by Stu Hood <st...@rackspace.com>.

In 0.6.0 and trunk, it is located at src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java

You might be using a pre-release version of 0.6 if you are seeing a fat client based InputFormat.


-----Original Message-----
From: "Joost Ouwerkerk" <jo...@openplaces.org>
Sent: Sunday, April 18, 2010 4:53pm
To: user@cassandra.apache.org
Subject: Re: Help with MapReduce

Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
have a preference about client, I just want to be consistent with
ColumnInputFormat.

On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:

> ColumnFamilyInputFormat no longer uses the fat client API, and instead uses
> Thrift. There are still some significant problems with the fat client, so it
> shouldn't be used without a good understanding of those problems.
>
> If you still want to use it, check out contrib/bmt_example, but I'd
> recommend that you use thrift for now.
>
> -----Original Message-----
> From: "Joost Ouwerkerk" <jo...@openplaces.org>
> Sent: Sunday, April 18, 2010 2:59pm
> To: user@cassandra.apache.org
> Subject: Help with MapReduce
>
> I'm a Cassandra noob trying to validate Cassandra as a viable alternative
> to
> HBase (which we've been using for over a year) for our application.  So
> far,
> I've had no success getting Cassandra working with MapReduce.
>
> My first step is inserting data into Cassandra.  I've created a MapRed job
> based using the fat client API.  I'm using the fat client (StorageProxy)
> because that's what ColumnFamilyInputFormat uses and I want to use the same
> API for both read and write jobs.
>
> When I call StorageProxy.mutate(), nothing happens.  The job completes as
> if
> it had done something, but in fact nothing has changed in the cluster.
>  When
> I call StorageProxy.mutateBlocking(), I get an IOException complaining that
> there is no connection to the cluster.  I've concluded with the debugger
> that StorageService is not connecting to the cluster, even though I've
> specified the correct seed and ListenAddress (I've using the exact same
> storage-conf.xml as the nodes in the cluster).
>
> I'm sure I'm missing something obvious in the configuration or my setup,
> but
> since I'm new to Cassandra, I can't see what it is.
>
> Any help appreciated,
> Joost
>
>
>

Re: Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.org>.

Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
have a preference about client, I just want to be consistent with
ColumnInputFormat.

On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood <st...@rackspace.com> wrote:

> ColumnFamilyInputFormat no longer uses the fat client API, and instead uses
> Thrift. There are still some significant problems with the fat client, so it
> shouldn't be used without a good understanding of those problems.
>
> If you still want to use it, check out contrib/bmt_example, but I'd
> recommend that you use thrift for now.
>
> -----Original Message-----
> From: "Joost Ouwerkerk" <jo...@openplaces.org>
> Sent: Sunday, April 18, 2010 2:59pm
> To: user@cassandra.apache.org
> Subject: Help with MapReduce
>
> I'm a Cassandra noob trying to validate Cassandra as a viable alternative
> to
> HBase (which we've been using for over a year) for our application.  So
> far,
> I've had no success getting Cassandra working with MapReduce.
>
> My first step is inserting data into Cassandra.  I've created a MapRed job
> based using the fat client API.  I'm using the fat client (StorageProxy)
> because that's what ColumnFamilyInputFormat uses and I want to use the same
> API for both read and write jobs.
>
> When I call StorageProxy.mutate(), nothing happens.  The job completes as
> if
> it had done something, but in fact nothing has changed in the cluster.
>  When
> I call StorageProxy.mutateBlocking(), I get an IOException complaining that
> there is no connection to the cluster.  I've concluded with the debugger
> that StorageService is not connecting to the cluster, even though I've
> specified the correct seed and ListenAddress (I've using the exact same
> storage-conf.xml as the nodes in the cluster).
>
> I'm sure I'm missing something obvious in the configuration or my setup,
> but
> since I'm new to Cassandra, I can't see what it is.
>
> Any help appreciated,
> Joost
>
>
>

RE: Help with MapReduce

Posted by Stu Hood <st...@rackspace.com>.

ColumnFamilyInputFormat no longer uses the fat client API, and instead uses Thrift. There are still some significant problems with the fat client, so it shouldn't be used without a good understanding of those problems.

If you still want to use it, check out contrib/bmt_example, but I'd recommend that you use thrift for now.

-----Original Message-----
From: "Joost Ouwerkerk" <jo...@openplaces.org>
Sent: Sunday, April 18, 2010 2:59pm
To: user@cassandra.apache.org
Subject: Help with MapReduce

I'm a Cassandra noob trying to validate Cassandra as a viable alternative to
HBase (which we've been using for over a year) for our application.  So far,
I've had no success getting Cassandra working with MapReduce.

My first step is inserting data into Cassandra.  I've created a MapRed job
based using the fat client API.  I'm using the fat client (StorageProxy)
because that's what ColumnFamilyInputFormat uses and I want to use the same
API for both read and write jobs.

When I call StorageProxy.mutate(), nothing happens.  The job completes as if
it had done something, but in fact nothing has changed in the cluster.  When
I call StorageProxy.mutateBlocking(), I get an IOException complaining that
there is no connection to the cluster.  I've concluded with the debugger
that StorageService is not connecting to the cluster, even though I've
specified the correct seed and ListenAddress (I've using the exact same
storage-conf.xml as the nodes in the cluster).

I'm sure I'm missing something obvious in the configuration or my setup, but
since I'm new to Cassandra, I can't see what it is.

Any help appreciated,
Joost

Help with MapReduce

Posted by Joost Ouwerkerk <jo...@openplaces.com>.

I'm a Cassandra noob trying to validate Cassandra as a viable alternative to
HBase (which we've been using for over a year) for our application.  So far,
I've had no success getting Cassandra working with MapReduce.

My first step is inserting data into Cassandra.  I've created a MapRed job
based using the fat client API.  I'm using the fat client (StorageProxy)
because that's what ColumnFamilyInputFormat uses and I want to use the same
API for both read and write jobs.

When I call StorageProxy.mutate(), nothing happens.  The job completes as if
it had done something, but in fact nothing has changed in the cluster.  When
I call StorageProxy.mutateBlocking(), I get an IOException complaining that
there is no connection to the cluster.  I've concluded with the debugger
that StorageService is not connecting to the cluster, even though I've
specified the correct seed and ListenAddress (I've using the exact same
storage-conf.xml as the nodes in the cluster).

I'm sure I'm missing something obvious in the configuration or my setup, but
since I'm new to Cassandra, I can't see what it is.

Any help appreciated,
Joost