You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Dave Martin <mo...@googlemail.com> on 2010/12/12 08:26:38 UTC

OutOfMemory on count on cassandra 0.6.8 for large number of columns

Hi there,

I see the following:

1) Add 8,000,000 columns to a single row. Each column name is a UUID.
2) Use cassandra-cli to run count keyspace.cf['myGUID']

The following is reported in the logs:

ERROR [DroppedMessagesLogger] 2010-12-12 18:17:36,046 CassandraDaemon.java (line 87) Uncaught exception in thread Thread[DroppedMessagesLogger,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [pool-1-thread-2] 2010-12-12 18:17:36,046 Cassandra.java (line 1407) Internal error processing get_count
java.lang.OutOfMemoryError: Java heap space

and Cassandra falls over. I see the same behaviour with 0.6.6.

Increasing the memory allocation with the -Xmx & -Xms args to 4GB allows the count to return in this particular example (i.e. no OutOfMemory is thrown).

Here's the scala code that was ran to load the column, which uses the AKKA persistence API:

object ColumnTest {
	def main(args : Array[String]) : Unit = {
		println("Super column test starting")
		val hosts = Array{"localhost"}
		val sessions = new CassandraSessionPool("occurrence",StackPool(SocketProvider("localhost", 9160)),Protocol.Binary,ConsistencyLevel.ONE)
		val session = sessions.newSession
		loadRow("myGUID", 8000000, session)
		session.close
	}
	
	def loadRow(key:String, noOfColumns:Int, session:CassandraSession){
		print("loading: "+key+", with columns: "+noOfColumns)
		val start = System.currentTimeMillis
		val rawPath = new ColumnPath("dr")
		for(i <- 0 until noOfColumns){
			val recordUuid = UUID.randomUUID.toString
			session ++| (key, rawPath.setColumn(recordUuid.getBytes), "1".getBytes, System.currentTimeMillis)
			session.flush
		}
		val finish = System.currentTimeMillis
		print(", Time taken (secs) :" +((finish-start)/1000) + " seconds.\n")
	}
}

Heres the configuration used:

# Arguments to pass to the JVM
JVM_OPTS=" \
        -ea \
        -Xms1G \
        -Xmx2G \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:SurvivorRatio=8 \
        -XX:MaxTenuringThreshold=1 \
        -XX:CMSInitiatingOccupancyFraction=75 \
        -XX:+UseCMSInitiatingOccupancyOnly \
        -XX:+HeapDumpOnOutOfMemoryError \
        -Dcom.sun.management.jmxremote.port=8080 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"

Admittedly the resource allocation is small, but I wondered if there should be some configuration guidelines (e.g. memory allocation vs number of columns supported).
        
Im running this on my MBP with a single node and java as thus:

$ java -version
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
        
Heres the CF definition:

    <Keyspace Name="occurrence">
      <ColumnFamily Name="dr"
                    CompareWith="UTF8Type"
                    Comment="The column family for dataset tracking"/>
     <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
     <ReplicationFactor>1</ReplicationFactor>
     <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>
    
Apologies in advance if this is a known issue or a known limitation of 0.6.x.
I had wondered if I was hitting the 2GB row limit for 0.6.x releases, but 8mill columns = 300MB approx in this particular case.   
I guess it may also be a result of the limitations with thrift (i.e. no streaming capabilities).
    
Any thoughts appreciated,

Dave
    








Re: Unsubscribe

Posted by Peter Schuller <pe...@infidyne.com>.
> Unsubscribe

http://wiki.apache.org/cassandra/FAQ#unsubscribe


-- 
/ Peter Schuller

Unsubscribe

Posted by Colin <co...@cloudeventprocessing.com>.
Unsubscribe

Please

Sent from my iPad

On Dec 12, 2010, at 1:26 AM, Dave Martin <mo...@googlemail.com> wrote:

> Hi there,
> 
> I see the following:
> 
> 1) Add 8,000,000 columns to a single row. Each column name is a UUID.
> 2) Use cassandra-cli to run count keyspace.cf['myGUID']
> 
> The following is reported in the logs:
> 
> ERROR [DroppedMessagesLogger] 2010-12-12 18:17:36,046 CassandraDaemon.java (line 87) Uncaught exception in thread Thread[DroppedMessagesLogger,5,main]
> java.lang.OutOfMemoryError: Java heap space
> ERROR [pool-1-thread-2] 2010-12-12 18:17:36,046 Cassandra.java (line 1407) Internal error processing get_count
> java.lang.OutOfMemoryError: Java heap space
> 
> and Cassandra falls over. I see the same behaviour with 0.6.6.
> 
> Increasing the memory allocation with the -Xmx & -Xms args to 4GB allows the count to return in this particular example (i.e. no OutOfMemory is thrown).
> 
> Here's the scala code that was ran to load the column, which uses the AKKA persistence API:
> 
> object ColumnTest {
>    def main(args : Array[String]) : Unit = {
>        println("Super column test starting")
>        val hosts = Array{"localhost"}
>        val sessions = new CassandraSessionPool("occurrence",StackPool(SocketProvider("localhost", 9160)),Protocol.Binary,ConsistencyLevel.ONE)
>        val session = sessions.newSession
>        loadRow("myGUID", 8000000, session)
>        session.close
>    }
>    
>    def loadRow(key:String, noOfColumns:Int, session:CassandraSession){
>        print("loading: "+key+", with columns: "+noOfColumns)
>        val start = System.currentTimeMillis
>        val rawPath = new ColumnPath("dr")
>        for(i <- 0 until noOfColumns){
>            val recordUuid = UUID.randomUUID.toString
>            session ++| (key, rawPath.setColumn(recordUuid.getBytes), "1".getBytes, System.currentTimeMillis)
>            session.flush
>        }
>        val finish = System.currentTimeMillis
>        print(", Time taken (secs) :" +((finish-start)/1000) + " seconds.\n")
>    }
> }
> 
> Heres the configuration used:
> 
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>        -ea \
>        -Xms1G \
>        -Xmx2G \
>        -XX:+UseParNewGC \
>        -XX:+UseConcMarkSweepGC \
>        -XX:+CMSParallelRemarkEnabled \
>        -XX:SurvivorRatio=8 \
>        -XX:MaxTenuringThreshold=1 \
>        -XX:CMSInitiatingOccupancyFraction=75 \
>        -XX:+UseCMSInitiatingOccupancyOnly \
>        -XX:+HeapDumpOnOutOfMemoryError \
>        -Dcom.sun.management.jmxremote.port=8080 \
>        -Dcom.sun.management.jmxremote.ssl=false \
>        -Dcom.sun.management.jmxremote.authenticate=false"
> 
> Admittedly the resource allocation is small, but I wondered if there should be some configuration guidelines (e.g. memory allocation vs number of columns supported).
> 
> Im running this on my MBP with a single node and java as thus:
> 
> $ java -version
> java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
> 
> Heres the CF definition:
> 
>    <Keyspace Name="occurrence">
>      <ColumnFamily Name="dr"
>                    CompareWith="UTF8Type"
>                    Comment="The column family for dataset tracking"/>
>     <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
>     <ReplicationFactor>1</ReplicationFactor>
>     <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>    </Keyspace>
> 
> Apologies in advance if this is a known issue or a known limitation of 0.6.x.
> I had wondered if I was hitting the 2GB row limit for 0.6.x releases, but 8mill columns = 300MB approx in this particular case.   
> I guess it may also be a result of the limitations with thrift (i.e. no streaming capabilities).
> 
> Any thoughts appreciated,
> 
> Dave
> 
> 
> 
> 
> 
> 
> 
> 

Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns

Posted by Tyler Hobbs <ty...@riptano.com>.
Well, in this case I would say you probably need about 300MB of space in the
heap, since that's what you've calculated.

The APIs are designed to let you do what you think is best and they
definitely won't stop you from shooting yourself in the foot.  Counting a
huge row, or trying to grab every row in a large column family are examples
of this.  Some of the clients try to protect you from this, but there is
only so much that can be done without specific knowledge of the data, and
get_count() is an example of this.

While we're on the topic of large rows, if your row is essentially unbounded
in size, you need to consider splitting it. This is especially true if you
stay with 0.6, where compactions of large rows can OOM you pretty easily.

- Tyler

On Sun, Dec 12, 2010 at 2:07 AM, Dave Martin <mo...@googlemail.com>wrote:

> Thanks Tyler. I was unaware of counters.
>
> The use case for column counts is really from a operational perspective,
> to allow a sysadmin to do adhoc checks on columns to see if something
> has gone wrong in software outside of cassandra.
>
> I think running a cassandra-cli command such as count, which makes
> cassandra fall over is not ideal,
> unless we can say for X number of columns cassandra needs at least Y
> memory allocation for stability.
>
> Cheers
>
> Dave
>
>
> On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs <ty...@riptano.com> wrote:
> > Cassandra has to deserialize all of the columns in the row for
> get_count().
> > So from Cassandra's perspective, it's almost as much work as getting the
> > entire row, it just doesn't have to send everything back over the
> network.
> >
> > If you're frequently counting 8 million columns (or really, anything
> > significant), you need to use counters instead.  If this is a rare
> > occurrence, you can do the count in multiple chunks by using a starting
> and
> > ending column in the SlicePredicate for each chunk, but this requires
> some
> > rough knowledge about the distribution of the column names in the row.
> >
> > - Tyler
>

Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns

Posted by Dave Martin <mo...@googlemail.com>.
Thanks Tyler. I was unaware of counters.

The use case for column counts is really from a operational perspective,
to allow a sysadmin to do adhoc checks on columns to see if something
has gone wrong in software outside of cassandra.

I think running a cassandra-cli command such as count, which makes
cassandra fall over is not ideal,
unless we can say for X number of columns cassandra needs at least Y
memory allocation for stability.

Cheers

Dave


On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs <ty...@riptano.com> wrote:
> Cassandra has to deserialize all of the columns in the row for get_count().
> So from Cassandra's perspective, it's almost as much work as getting the
> entire row, it just doesn't have to send everything back over the network.
>
> If you're frequently counting 8 million columns (or really, anything
> significant), you need to use counters instead.  If this is a rare
> occurrence, you can do the count in multiple chunks by using a starting and
> ending column in the SlicePredicate for each chunk, but this requires some
> rough knowledge about the distribution of the column names in the row.
>
> - Tyler

Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns

Posted by Tyler Hobbs <ty...@riptano.com>.
Cassandra has to deserialize all of the columns in the row for get_count().
So from Cassandra's perspective, it's almost as much work as getting the
entire row, it just doesn't have to send everything back over the network.

If you're frequently counting 8 million columns (or really, anything
significant), you need to use counters instead.  If this is a rare
occurrence, you can do the count in multiple chunks by using a starting and
ending column in the SlicePredicate for each chunk, but this requires some
rough knowledge about the distribution of the column names in the row.

- Tyler

On Sun, Dec 12, 2010 at 1:26 AM, Dave Martin <mo...@googlemail.com>wrote:

> Hi there,
>
> I see the following:
>
> 1) Add 8,000,000 columns to a single row. Each column name is a UUID.
> 2) Use cassandra-cli to run count keyspace.cf['myGUID']
>
> The following is reported in the logs:
>
> ERROR [DroppedMessagesLogger] 2010-12-12 18:17:36,046 CassandraDaemon.java
> (line 87) Uncaught exception in thread Thread[DroppedMessagesLogger,5,main]
> java.lang.OutOfMemoryError: Java heap space
> ERROR [pool-1-thread-2] 2010-12-12 18:17:36,046 Cassandra.java (line 1407)
> Internal error processing get_count
> java.lang.OutOfMemoryError: Java heap space
>
> and Cassandra falls over. I see the same behaviour with 0.6.6.
>
> Increasing the memory allocation with the -Xmx & -Xms args to 4GB allows
> the count to return in this particular example (i.e. no OutOfMemory is
> thrown).
>
> Here's the scala code that was ran to load the column, which uses the AKKA
> persistence API:
>
> object ColumnTest {
>        def main(args : Array[String]) : Unit = {
>                println("Super column test starting")
>                val hosts = Array{"localhost"}
>                val sessions = new
> CassandraSessionPool("occurrence",StackPool(SocketProvider("localhost",
> 9160)),Protocol.Binary,ConsistencyLevel.ONE)
>                val session = sessions.newSession
>                loadRow("myGUID", 8000000, session)
>                session.close
>        }
>
>        def loadRow(key:String, noOfColumns:Int, session:CassandraSession){
>                print("loading: "+key+", with columns: "+noOfColumns)
>                val start = System.currentTimeMillis
>                val rawPath = new ColumnPath("dr")
>                for(i <- 0 until noOfColumns){
>                        val recordUuid = UUID.randomUUID.toString
>                        session ++| (key,
> rawPath.setColumn(recordUuid.getBytes), "1".getBytes,
> System.currentTimeMillis)
>                        session.flush
>                }
>                val finish = System.currentTimeMillis
>                print(", Time taken (secs) :" +((finish-start)/1000) + "
> seconds.\n")
>        }
> }
>
> Heres the configuration used:
>
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>        -ea \
>        -Xms1G \
>        -Xmx2G \
>        -XX:+UseParNewGC \
>        -XX:+UseConcMarkSweepGC \
>        -XX:+CMSParallelRemarkEnabled \
>        -XX:SurvivorRatio=8 \
>        -XX:MaxTenuringThreshold=1 \
>        -XX:CMSInitiatingOccupancyFraction=75 \
>        -XX:+UseCMSInitiatingOccupancyOnly \
>        -XX:+HeapDumpOnOutOfMemoryError \
>        -Dcom.sun.management.jmxremote.port=8080 \
>        -Dcom.sun.management.jmxremote.ssl=false \
>        -Dcom.sun.management.jmxremote.authenticate=false"
>
> Admittedly the resource allocation is small, but I wondered if there should
> be some configuration guidelines (e.g. memory allocation vs number of
> columns supported).
>
> Im running this on my MBP with a single node and java as thus:
>
> $ java -version
> java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
>
> Heres the CF definition:
>
>    <Keyspace Name="occurrence">
>      <ColumnFamily Name="dr"
>                    CompareWith="UTF8Type"
>                    Comment="The column family for dataset tracking"/>
>
> <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
>     <ReplicationFactor>1</ReplicationFactor>
>
> <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>    </Keyspace>
>
> Apologies in advance if this is a known issue or a known limitation of
> 0.6.x.
> I had wondered if I was hitting the 2GB row limit for 0.6.x releases, but
> 8mill columns = 300MB approx in this particular case.
> I guess it may also be a result of the limitations with thrift (i.e. no
> streaming capabilities).
>
> Any thoughts appreciated,
>
> Dave
>
>
>
>
>
>
>
>
>