You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Markus Klems <ma...@gmail.com> on 2013/06/13 19:47:05 UTC

Scaling a cassandra cluster with auto_bootstrap set to false

Hi Cassandra community,

we are currently experimenting with different Cassandra scaling
strategies. We observed that Cassandra performance decreases
drastically when we insert more data into the cluster (say, going from
60GB to 600GB in a 3-node cluster). So we want to find out how to deal
with this problem. One scaling strategy seems interesting but we don't
fully understand what is going on, yet. The strategy works like this:
add new nodes to a Cassandra cluster with "auto_bootstrap = false" to
avoid streaming to the new nodes. We were a bit surprised that this
strategy improved performance considerably and that it worked much
better than other strategies that we tried before, both in terms of
scaling speed and performance impact during scaling.

Let me share our little experiment with you:

In a initial setup S1 we have 4 nodes where each node is similar to
the Amazon EC2 large instance type, i.e., 4 cores, 15GB memory, 700GB
free disk space, Cassandra replication factor 2. Each node is loaded
with 10 million 1KB rows into a single column family, i.e., ~20 GB
data/node, using the Yahoo Cloud Serving Benchmark (YCSB) tool. All
Cassandra settings are default. In the setup S1 we achieved an average
throughput of ~800 ops/s. The workload is a 95/5 read/update mix with
a Zipfian workload distribution (= YCSB workload B).

Setup S2: We then added two empty nodes to our 4-node cluster with
auto_bootstrap set to false. The throughput that we observered
thereafter tripled from 800 ops/s to 2,400 ops/s. We looked at various
outputs from nodetool commands to understand this effect. On the new
nodes, "$ nodetool info" tells us that the keycache is empty; "$
nodetool cfstats" clearly shows write and read requests coming in. The
memtable columns count and data size are multiple times larger
compared to the other four nodes.

We are wondering: what exactly gets stored on the two new nodes in
setup S2 and where (cache, memtable, disk?). Would it be necessary (in
a production environment) to stream the old SSTables from the other
four nodes at some point in time? Or can we simply be happy with the
performance improvement and leave it like this? Are we missing
something here; can you advise us to look at specific monitoring data
to better understand the observed effect?

Thanks,

Markus Klems

Re: Scaling a cassandra cluster with auto_bootstrap set to false

Posted by Markus Klems <ma...@gmail.com>.
Robert,

thank you for your explanation. I think you are right. YCSB probably
does not correctly interpret the "missing record" response. We will
look into it and report our results here in the next days.

Thanks,

Markus

On Thu, Jun 13, 2013 at 9:47 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems <ma...@gmail.com> wrote:
>> One scaling strategy seems interesting but we don't
>> fully understand what is going on, yet. The strategy works like this:
>> add new nodes to a Cassandra cluster with "auto_bootstrap = false" to
>> avoid streaming to the new nodes.
>
> If you set auto_bootstrap to false, new nodes take over responsibility
> for a range of the ring but do not receive the data for the range from
> the old nodes. If you read the new node at CL.ONE, you will get the
> answer that data you wrote to the old node does not exist, because the
> new node did not receive it as part of bootstrap. This is probably not
> what you expect.
>
>> We were a bit surprised that this
>> strategy improved performance considerably and that it worked much
>> better than other strategies that we tried before, both in terms of
>> scaling speed and performance impact during scaling.
>
> CL.ONE requests for rows which do not exist are very fast.
>
>> Would it be necessary (in a production environment) to stream the old SSTables from the other
>> four nodes at some point in time?
>
> Bootstrapping is necessary for consistency and durability, yes. If you were to :
>
> 1) start new node without bootstrapping it
> 2) run "cleanup" compaction on the old node
>
> You would permanently delete the copy of the data that is no longer
> "supposed" to live on the old node. With a RF of 1, that data would be
> permanently gone. With a RF of >1 you have other copies, but if you
> never bootstrap while adding new nodes you are relatively likely to
> not be able to access those copies over time.
>
> =Rob

Re: Scaling a cassandra cluster with auto_bootstrap set to false

Posted by Markus Klems <ma...@gmail.com>.
On Thu, Jun 13, 2013 at 11:20 PM, Edward Capriolo <ed...@gmail.com> wrote:
> CL.ONE requests for rows which do not exist are very fast.
>
> http://adrianotto.com/2010/08/dev-null-unlimited-scale/
>

Yep, /dev/null is a might force ;-)

I took a look at the YCSB source code and spotted the line of code
that caused our confusion: it's in file
https://github.com/brianfrankcooper/YCSB/blob/master/core/src/main/java/com/yahoo/ycsb/workloads/CoreWorkload.java
in the "public boolean doTransaction(DB db, Object threadstate)"
method in line 497. No matter what the result of a YCSB transaction
operation is, the method always returns "true". Not sure if this is a
desirable behavior of a benchmarking tool. It makes it difficult to
spot these kind of mistakes.

The problem can also be observed by running this piece of code:

  public static void main(String[] args)
  {
    CassandraClient10 cli = new CassandraClient10();

    Properties props = new Properties();

    props.setProperty("hosts", args[0]);
    cli.setProperties(props);

    try
    {
      cli.init();
    } catch (Exception e)
    {
      e.printStackTrace();
      System.exit(0);
    }

    HashMap<String, ByteIterator> vals = new HashMap<String, ByteIterator>();
    vals.put("age", new StringByteIterator("57"));
    vals.put("middlename", new StringByteIterator("bradley"));
    vals.put("favoritecolor", new StringByteIterator("blue"));
    int res = cli.insert("usertable", "BrianFrankCooper", vals);
    System.out.println("Result of insert: " + res);

    HashMap<String, ByteIterator> result = new HashMap<String, ByteIterator>();
    HashSet<String> fields = new HashSet<String>();
    fields.add("middlename");
    fields.add("age");
    fields.add("favoritecolor");
    res = cli.read("usertable", "BrianFrankCooper", null, result);
    System.out.println("Result of read: " + res);
    for (String s : result.keySet())
    {
      System.out.println("[" + s + "]=[" + result.get(s) + "]");
    }

    res = cli.delete("usertable", "BrianFrankCooper");
    System.out.println("Result of delete: " + res);

    res = cli.read("usertable", "BrianFrankCooper", null, result);
    System.out.println("Result of read: " + res);
    for (String s : result.keySet())
    {
      System.out.println("[" + s + "]=[" + result.get(s) + "]");
    }
  }

which results in:

Result of insert: 0
Result of read: 0
[middlename]=[bradley]
[favoritecolor]=[blue]
[age]=[57]
Result of delete: 0
Result of read: 0
[middlename]=[]
[favoritecolor]=[]
[age]=[]

The second read should not return "true" ("0").

@Robert & Edward, thanks for your help,

-Markus

Re: Scaling a cassandra cluster with auto_bootstrap set to false

Posted by Edward Capriolo <ed...@gmail.com>.
CL.ONE requests for rows which do not exist are very fast.

http://adrianotto.com/2010/08/dev-null-unlimited-scale/



On Thu, Jun 13, 2013 at 3:47 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems <ma...@gmail.com>
> wrote:
> > One scaling strategy seems interesting but we don't
> > fully understand what is going on, yet. The strategy works like this:
> > add new nodes to a Cassandra cluster with "auto_bootstrap = false" to
> > avoid streaming to the new nodes.
>
> If you set auto_bootstrap to false, new nodes take over responsibility
> for a range of the ring but do not receive the data for the range from
> the old nodes. If you read the new node at CL.ONE, you will get the
> answer that data you wrote to the old node does not exist, because the
> new node did not receive it as part of bootstrap. This is probably not
> what you expect.
>
> > We were a bit surprised that this
> > strategy improved performance considerably and that it worked much
> > better than other strategies that we tried before, both in terms of
> > scaling speed and performance impact during scaling.
>
> CL.ONE requests for rows which do not exist are very fast.
>
> > Would it be necessary (in a production environment) to stream the old
> SSTables from the other
> > four nodes at some point in time?
>
> Bootstrapping is necessary for consistency and durability, yes. If you
> were to :
>
> 1) start new node without bootstrapping it
> 2) run "cleanup" compaction on the old node
>
> You would permanently delete the copy of the data that is no longer
> "supposed" to live on the old node. With a RF of 1, that data would be
> permanently gone. With a RF of >1 you have other copies, but if you
> never bootstrap while adding new nodes you are relatively likely to
> not be able to access those copies over time.
>
> =Rob
>

Re: Scaling a cassandra cluster with auto_bootstrap set to false

Posted by Robert Coli <rc...@eventbrite.com>.
On Thu, Jun 13, 2013 at 10:47 AM, Markus Klems <ma...@gmail.com> wrote:
> One scaling strategy seems interesting but we don't
> fully understand what is going on, yet. The strategy works like this:
> add new nodes to a Cassandra cluster with "auto_bootstrap = false" to
> avoid streaming to the new nodes.

If you set auto_bootstrap to false, new nodes take over responsibility
for a range of the ring but do not receive the data for the range from
the old nodes. If you read the new node at CL.ONE, you will get the
answer that data you wrote to the old node does not exist, because the
new node did not receive it as part of bootstrap. This is probably not
what you expect.

> We were a bit surprised that this
> strategy improved performance considerably and that it worked much
> better than other strategies that we tried before, both in terms of
> scaling speed and performance impact during scaling.

CL.ONE requests for rows which do not exist are very fast.

> Would it be necessary (in a production environment) to stream the old SSTables from the other
> four nodes at some point in time?

Bootstrapping is necessary for consistency and durability, yes. If you were to :

1) start new node without bootstrapping it
2) run "cleanup" compaction on the old node

You would permanently delete the copy of the data that is no longer
"supposed" to live on the old node. With a RF of 1, that data would be
permanently gone. With a RF of >1 you have other copies, but if you
never bootstrap while adding new nodes you are relatively likely to
not be able to access those copies over time.

=Rob