You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Paul Prescod <pa...@prescod.net> on 2010/04/07 12:51:55 UTC

OrderPreservingPartitioner limits and workarounds

I have one append-oriented workload and I would like to know if
Cassandra is appropriate for it.

Given:

 * 100 nodes

 * an OrderPreservingPartitioner

 * a replication factor of "3"

 * a write-pattern of "always append"

 * a strong requirement for range queries

My understanding is that there will exist 3 nodes will end up being
responsible for all writes and potentially a disproportionate amount
of reads (in the common case that users care more about recent data
than older data).

Is there some manual way I can fiddle with InitialTokens and
ReplicationFactors to share the load more fairly?

 Paul Prescod

Re: OrderPreservingPartitioner limits and workarounds

Posted by Mark Robson <ma...@gmail.com>.

On 7 April 2010 19:13, Jonathan Ellis <jb...@gmail.com> wrote:

> One thing you can do is manually "randomize" keys for any CFs that
> don't need the OP by pre-pending their md5 to the key you send
> Cassandra.  (This is all RP is doing under the hood anyway.)
>

Another possibility is to prepend some hash of something that you don't need
to range scan on to the beginning of the keys.

For example, if you have thousands of customers, but they individually want
to do range scans, then you can hash the customer ID and put that at the
beginning (I use a 16-bit hex hash, it gives enough distribution with sane
amounts of nodes).

Then you'll tend to get keys which start with 0000 - ffff followed by
whatever your increasing key is (timestamp etc). Workloads should tend to
balance out but will get a bit patchy if you have, for example, a small
number of disproportionately huge customers.

Mark

Re: Why can't you manage one node from another?

Posted by Jonathan Ellis <jb...@gmail.com>.

looks like you are running into http://wiki.apache.org/cassandra/JmxGotchas

On Wed, Apr 7, 2010 at 2:21 PM, Mark Jones <MJ...@imagehawk.com> wrote:
> I have 3 nodes in the cluster, and
>    bin/nodetool --host this-host-name ring
> Works as expected, but
>    bin/nodetool --host some-other-host ring
>
> always throws this exception:
>
> Error connecting to remote JMX agent!
> java.rmi.ConnectException: Connection refused to host: 127.0.1.1; nested exception is:
>        java.net.ConnectException: Connection refused
>        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
>        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
>        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
>        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:128)
>        at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source)
>        at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2343)
>        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:296)
>        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:267)
>        at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:105)
>        at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:81)
>        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:404)
> Caused by: java.net.ConnectException: Connection refused
>        at java.net.PlainSocketImpl.socketConnect(Native Method)
>        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
>        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
>        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
>        at java.net.Socket.connect(Socket.java:542)
>        at java.net.Socket.connect(Socket.java:492)
>        at java.net.Socket.<init>(Socket.java:389)
>        at java.net.Socket.<init>(Socket.java:203)
>        at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
>        at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:146)
>        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
>        ... 10 more
>
>
> My hosts file looks like:
>
> 127.0.0.1       localhost
> 127.0.1.1       ec1
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
>

Why can't you manage one node from another?

Posted by Mark Jones <MJ...@imagehawk.com>.

I have 3 nodes in the cluster, and
    bin/nodetool --host this-host-name ring
Works as expected, but
    bin/nodetool --host some-other-host ring

always throws this exception:

Error connecting to remote JMX agent!
java.rmi.ConnectException: Connection refused to host: 127.0.1.1; nested exception is:
        java.net.ConnectException: Connection refused
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
        at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:128)
        at javax.management.remote.rmi.RMIServerImpl_Stub.newClient(Unknown Source)
        at javax.management.remote.rmi.RMIConnector.getConnection(RMIConnector.java:2343)
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:296)
        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:267)
        at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:105)
        at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:81)
        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:404)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
        at java.net.Socket.connect(Socket.java:542)
        at java.net.Socket.connect(Socket.java:492)
        at java.net.Socket.<init>(Socket.java:389)
        at java.net.Socket.<init>(Socket.java:203)
        at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
        at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:146)
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
        ... 10 more


My hosts file looks like:

127.0.0.1       localhost
127.0.1.1       ec1

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Re: OrderPreservingPartitioner limits and workarounds

Posted by Jonathan Ellis <jb...@gmail.com>.

One thing you can do is manually "randomize" keys for any CFs that
don't need the OP by pre-pending their md5 to the key you send
Cassandra.  (This is all RP is doing under the hood anyway.)

On Wed, Apr 7, 2010 at 5:51 AM, Paul Prescod <pa...@prescod.net> wrote:
> I have one append-oriented workload and I would like to know if
> Cassandra is appropriate for it.
>
> Given:
>
>  * 100 nodes
>
>  * an OrderPreservingPartitioner
>
>  * a replication factor of "3"
>
>  * a write-pattern of "always append"
>
>  * a strong requirement for range queries
>
> My understanding is that there will exist 3 nodes will end up being
> responsible for all writes and potentially a disproportionate amount
> of reads (in the common case that users care more about recent data
> than older data).
>
> Is there some manual way I can fiddle with InitialTokens and
> ReplicationFactors to share the load more fairly?
>
>  Paul Prescod
>

Re: Can these stats be right?

Posted by Rob Coli <rc...@digg.com>.

On 4/7/10 12:16 PM, Mark Jones wrote:
>                  Read Latency: NaN ms.

./trunk/src/java/org/apache/cassandra/tools/NodeCmd.java
"
outs.println("\t\tRead Latency: " + String.format("%01.3f", 
cfstore.getRecentReadLatencyMicros() / 1000) + "
"

This call is telling you the (Read|Write|Range)Latency since the last 
time it was sampled. The other measure of latency is a value since the 
node started, exposed via the JMX interface as 
"org.apache.cassandra.service.StorageProxy.Attributes.TotalRangeLatencyMicros". 


If you are sure to not restart your node while doing your testing, this 
should provide you with the latency number you are looking for.

http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.service.StorageProxy.Attributes.TotalRangeLatencyMicros

The JmxInterface wiki page is currently missing a section on the per-CF 
attributes, but as cfstats shows, they are available via the JMX 
interface as well.

=Rob

Can these stats be right?

Posted by Mark Jones <MJ...@imagehawk.com>.

>From cfstats:
                SSTable count: 3
                Space used (live): 4951669191
                Space used (total): 5237040637
                Memtable Columns Count: 190266
                Memtable Data Size: 23459012
                Memtable Switch Count: 89
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 13307292
                Write Latency: 0.045 ms.
                Pending Tasks: 0
                Key cache capacity: 200000
                Key cache size: 0
                Key cache hit rate: NaN
                Row cache: disabled
                Compacted row minimum size: 446
                Compacted row maximum size: 111524663
                Compacted row mean size: 880

A Write Latency of .045 ms should translate into > 22,000 writes per second, yet my 80 threads pumping data at this node, do good to get to 6000/second and the machine sending the writes is 97% idle.

I have 3 nodes, with a replication factor of 2.

Re: OrderPreservingPartitioner limits and workarounds

Posted by Paul Prescod <pa...@prescod.net>.

Since I wrote that at 3:51AM (my time) I came to many of the same
conclusions and decided to write them up to try and provide a
high-level guide on sorting and ordering.

 * http://jottit.com/s8c4a/

But for completeness I was still hoping to document any workarounds
that would help mitigate load balancing issues with the OPP.

On Wed, Apr 7, 2010 at 10:46 AM, Benjamin Black <b...@b3k.us> wrote:
> I'd suggest you use RandomPartitioner, an index, and multiget.  You'll
> be able to do range queries and won't have the load imbalance and
> performance problems of OPP and native range queries.
>
>
> b
>
> On Wed, Apr 7, 2010 at 3:51 AM, Paul Prescod <pa...@prescod.net> wrote:
>> I have one append-oriented workload and I would like to know if
>> Cassandra is appropriate for it.
>>
>> Given:
>>
>>  * 100 nodes
>>
>>  * an OrderPreservingPartitioner
>>
>>  * a replication factor of "3"
>>
>>  * a write-pattern of "always append"
>>
>>  * a strong requirement for range queries
>>
>> My understanding is that there will exist 3 nodes will end up being
>> responsible for all writes and potentially a disproportionate amount
>> of reads (in the common case that users care more about recent data
>> than older data).
>>
>> Is there some manual way I can fiddle with InitialTokens and
>> ReplicationFactors to share the load more fairly?
>>
>>  Paul Prescod
>>
>

Re: OrderPreservingPartitioner limits and workarounds

Posted by Benjamin Black <b...@b3k.us>.

I'd suggest you use RandomPartitioner, an index, and multiget.  You'll
be able to do range queries and won't have the load imbalance and
performance problems of OPP and native range queries.


b

On Wed, Apr 7, 2010 at 3:51 AM, Paul Prescod <pa...@prescod.net> wrote:
> I have one append-oriented workload and I would like to know if
> Cassandra is appropriate for it.
>
> Given:
>
>  * 100 nodes
>
>  * an OrderPreservingPartitioner
>
>  * a replication factor of "3"
>
>  * a write-pattern of "always append"
>
>  * a strong requirement for range queries
>
> My understanding is that there will exist 3 nodes will end up being
> responsible for all writes and potentially a disproportionate amount
> of reads (in the common case that users care more about recent data
> than older data).
>
> Is there some manual way I can fiddle with InitialTokens and
> ReplicationFactors to share the load more fairly?
>
>  Paul Prescod
>

RE: OrderPreservingPartitioner limits and workarounds

Posted by Mark Jones <MJ...@imagehawk.com>.

Sounds like you want something like http://oss.oetiker.ch/rrdtool/

Assuming you are trying to store computer log data.

Do you have any other data that can spread the data load?  Like a machine name?  If so, you can use a hash of that value to place that "machine"  randomly on the net, then append the timestamp, this groups each machine on the ring (assuming you don't have a massive # of writes about each machine at one time).....

-----Original Message-----
From: prescod@gmail.com [mailto:prescod@gmail.com] On Behalf Of Paul Prescod
Sent: Wednesday, April 07, 2010 5:52 AM
To: cassandra user
Subject: OrderPreservingPartitioner limits and workarounds

I have one append-oriented workload and I would like to know if
Cassandra is appropriate for it.

Given:

 * 100 nodes

 * an OrderPreservingPartitioner

 * a replication factor of "3"

 * a write-pattern of "always append"

 * a strong requirement for range queries

My understanding is that there will exist 3 nodes will end up being
responsible for all writes and potentially a disproportionate amount
of reads (in the common case that users care more about recent data
than older data).

Is there some manual way I can fiddle with InitialTokens and
ReplicationFactors to share the load more fairly?

 Paul Prescod