You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Peter Schuller <pe...@infidyne.com> on 2010/07/07 19:09:25 UTC

Minimizing the impact of compaction on latency and throughput

Hello,

I have repeatedly seen users report that background compaction is
overly detrimental to the behavior of the node with respect to
latency. While I have not yet deployed cassandra in a production
situation where latencies are closely monitored, these reports do not
really sound very surprising to me given the nature of compaction and
unless otherwise stated by developers here on the list I tend to
believe that it is a real issue.

Ignoring implementation difficulties for a moment, a few things that
could improve the situation, that seem sensible to me, are:

* Utilizing posix_fadvise() on both reads and writes to avoid
obliterating the operating system's caching of the sstables.
* Add the ability to rate limit disk I/O (in particular writes).
* Add the ability to perform direct I/O.
* Add the ability to fsync() regularly on writes to force the
operating system to not decide to flush hundreds of megabytes of data
out in a single burst.
* (Not an improvement but general observation: it seems useless for
writes to the commit log to remain in cache after an fsync(), and so
they are a good candidate for posix_fadvise())

None of these would be silver bullets, and the importance and
appropriate settings for each would be very dependent on operating
system, hardware, etc. But having the ability to control some or all
of these should, I suspect, allow significantly lessening the impact
of compaction under a variety of circumstances.

With respect to cache eviction, the this is one area where the impact
can probably be expected to be higher the more you rely on the
operating systems caching, and the less you rely on in-JVM caching
done by cassandra.

The most obvious problem points to me include:

* posix_fadvise() and direct I/O cause portability and building
issues, necessitating native code.
* rate limiting is very indirect due to read-ahead, caching, etc. in
particular for writes, rate limiting them would likely be almost
useless without also having fsync() or direct I/O, unless it is rate
limited to an extremely small amount and the cluster is taking very
few writes (such that the typical background flushing done by most
OS:es is done often enough to not imply huge amounts of data)

Any thoughts? Has this already been considered and rejected? Do you
think compaction is in fact not a problem already? Are there other,
easier, better ways to accomplish the goal?

-- 
/ Peter Schuller

Re: Minimizing the impact of compaction on latency and throughput

Posted by Terje Marthinussen <tm...@gmail.com>.

On Tue, Jul 13, 2010 at 10:26 PM, Jonathan Ellis <jb...@gmail.com> wrote:

>
> I'm totally fine with saying "Here's a JNI library for Linux [or even
> Linux version >= 2.6.X]" since that makes up 99% of our production
> deployments, and leaving the remaining 1% with the status quo.
>

You really need to say Linux > 2.6 and filesystem xyz .

That probably reduces the percentage a bit, but probably not critically.

It is quite a while since I have written code for directio (I really try to
avoid using it anymore), but from memory, as long as there is a framework
which is somewhat extendable and can be used as a basis for new platforms,
it should be reasonably trivial for a somewhat experienced person to add a
new unix like platform in a couple of days.

No idea for windows. I have never written code for this there.

> > O_DIRECT also bypasses the cache completely
>
> Right, that's the idea. :)
>

Hm... I would have thought it was clear that my idea is that you do want to
interact with the cache if you can! :)

Under high load, you might reduce performance 10-30% by throwing out the
scheduling benefits you get from the OS (yes, that is based on real life
experience). Of course... that is given that you can somehow can avoid the
worst case scenarios without direct I/O. As always, things will differ from
use case to use case.

A well performing HW raid card with sufficient writeback cache might also
help reduce the negative impact of directio.

Funny enough, it is often the systems with light read load that is hardest
hit. Systems with heavy read load have more pressure on the cache on the
read side and the write will not push content out of the cache (or
applications out of physical memory) as easily. To make things more
annoying, OSes (not just linux) has a tendency of behaving different from
release to release. What is a problem on one linux release is not
necessarily a problem on another.

I have not seen huge problems when compacting on cassandra in terms of I/O
myself, but I am currently working on HW with loads of memory, so I might
not see the problems others see. I am more concerned with other performance
issues at the moment.

One nifty effect which may, or may not, be worth looking into, is what
happens when you flip over to the new compacted SSTable, the last thing you
write to the new compacted table will be there ready in cache to be read
once you start using it. It can as such be worth ordering the compaction so
that the most performance critical parts are written last and they are
written without direct I/O or similar settings so they will be ready in
cache when needed.

I am not sure to what extent parts of the SSTables have structures of
importance like this for Cassandra. Haven't really thought about it until
now.

Might also be worth looking at IO scheduler settings in the linux kernel.
Some of the io schedulers also supports ionice/io priorities.

I have never used it on single threads, but I have read that ioprio_set()
accepts thread id's (not just process ids like the man page indicate). While
not super efficient, in my experience, on preventing cache flushing of
mostly idle data, if the compaction I/O occurs in isolated threads so ionice
can be applied to that thread, it should help.

> Exactly: it the fadvise mode that would actually be useful to us, is a
> no-op and not likely to change soon. A bit of history:
>
> Interesting, I had not seen that before.
Thanks!

Terje

Re: Minimizing the impact of compaction on latency and throughput

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Jul 13, 2010 at 6:18 AM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT.

I'm totally fine with saying "Here's a JNI library for Linux [or even
Linux version >= 2.6.X]" since that makes up 99% of our production
deployments, and leaving the remaining 1% with the status quo.

> O_DIRECT also bypasses the cache completely

Right, that's the idea. :)

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

Exactly: it the fadvise mode that would actually be useful to us, is a
no-op and not likely to change soon. A bit of history:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm1/broken-out/fadvise-make-posix_fadv_noreuse-a-no-op.patch

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Minimizing the impact of compaction on latency and throughput

Posted by Peter Schuller <pe...@infidyne.com>.

> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT. Just have a look at open(2) on linux:

[snip]

> So, just within Linux you got different mechanisms for this depending on
> kernel and fs in use and you need to figure out what to do yourself as the
> OS will not tell you that. Don't expect this alignment stuff to be more
> standardized across OSes than inside of Linux. Still find this portable?

The concept of direct I/O yes. I don't have experience of what the
practical portability is with respect to alignment however, so maybe
those details are a problem. But things like under what circumstances
which flags to postix_fadvise() actually have the desired effect
doesn't feel very portable either.

One might have a look at to what extent direct I/O works well in e.g.
postgresql or something like that, across platforms. But maybe you're
right and O_DIRECT is just not worth it.

> O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O

That was the intent.

> scheduling and caching across multiple reads/writers in threaded apps and
> separated processes which the OS may offer.

This is specifically what I want to bypass. I want to bypass the
operating system's caching to (1) avoid trashing the cache and (2)
know that a rate limited write translates fairly well to underlying
storage. Rate limiting asynchronous writes will often be less than
ideal since the operating system will tend to, by design, defer
writes. This aspect can of course be overcome with fsync() however.
And that does not even require native code, so is a big point in its
favor. But if we still need native code for posix_fadvise() anyway
(for reads), then that hit is taken anyway.

But sure. Perhaps posix_fadvise() in combination with regular
fsync():ing on writes may be preferable to direct I/O (with fsync()
being required both for rate limiting purposes if one is to combat
that, and for avoiding cache eviction given the way fadvise works in
Linux atm).

> This can especially be a big
> loss when you got servers with loads of memory for large filesystem caches
> where you might find it hard to actually utilize the cache in the
> application.

The entire point is to bypass the cache during compaction. But this
does not (unless I'm mistaken about how Cassandra works) invalidate
already pre-existing caches at the Cassandra/JVM level. In addition,
for large data sets (large being significantly larger than RAM size),
the data pulled into cache as part of compaction is not going to be
useful anyway, as is. There is the special cases where all or most
data fit in RAM and having all compaction I/O go through the cache may
even be preferable; but in the general case, I really don't see the
advantage of having that I/O go through cache.

If you do have most or all data in RAM, than certainly having all that
data warm at all times is preferably to doing I/O on a cold buffer
cache against sstables. But on the other hand, any use of direct I/O
of fadvise() will be optional (presumably). Given that a setup whereby
your performance is entirely dependent on most data being in RAM at
all times, you will already have issues with e.g. cold starts of
nodes.

In any case, I definitely consider there to be good reasons to not
rely only on operating system caching; compaction is one of these
reasons both with and without direct I/o or fadvise().

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

I think there are pretty clear and obvious use-cases where the cache
eviction implied by large bulk streaming operations on large amounts
of data is not what you want (there are any number of practical
situations where this has been an issue for me, if nothing else). But
if I'm overlooking something that would mean that this optimization,
trying to avoid eviction, is useless with Cassandra please do explain
it to me :)

But I'll definitely buy that posix_fadvise() is probably a cleaner solution.

-- 
/ Peter Schuller

Re: Minimizing the impact of compaction on latency and throughput

Posted by Terje Marthinussen <tm...@gmail.com>.

> (2) posix_fadvise() feels more obscure and less portable than
> O_DIRECT, the latter being well-understood and used by e.g. databases
> for a long time.
>

Due to the need for doing data alignment in the application itself (you are
bypassing all the OS magic here), there is really nothing portable about
O_DIRECT. Just have a look at open(2) on linux:
----
  O_DIRECT
       The  O_DIRECT  flag may impose alignment restrictions on the length
and
       address of userspace buffers and the file offset  of  I/Os.   In
Linux
       alignment restrictions vary by file system and kernel version and
might
       be absent entirely.  However there is currently no file
system-indepen‐
       dent  interface for an application to discover these restrictions for
a
       given file or file system.  Some file systems provide their own
inter‐
       faces  for  doing  so,  for  example  the  XFS_IOC_DIOINFO operation
in
       xfsctl(3).
----
So, just within Linux you got different mechanisms for this depending on
kernel and fs in use and you need to figure out what to do yourself as the
OS will not tell you that. Don't expect this alignment stuff to be more
standardized across OSes than inside of Linux. Still find this portable?

O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O
scheduling and caching across multiple reads/writers in threaded apps and
separated processes which the OS may offer. This can especially be a big
loss when you got servers with loads of memory for large filesystem caches
where you might find it hard to actually utilize the cache in the
application.

O_DIRECT was made to solve HW performance limitation on servers 10+ years
ago. It is far from an ideal solution today (but until stuff like fadvice is
implemented properly, somewhat unavoidable)

Best regards,
Terje

Re: Minimizing the impact of compaction on latency and throughput

Posted by Peter Schuller <pe...@infidyne.com>.

> This looks relevant:
> http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
> comments for directions to code sample)

Thanks. That's helpful; I've been trying to avoid JNI in the past so
wasn't familiar with the API, and the main difficulty was likely to be
how to best expose the functionality to Java. Having someone do almost
exactly the same thing helps ;)

I'm also glad they confirmed the effect in a very similar situation.
I'm also leaning towards O_DIRECT as well because:

(1) Even if posix_fadvise() is used, on writes you'll need to fsync()
before fadvise() anyway in order to allow Linux to evict the pages (a
theoretical OS implementation might remember the advise call, but
Linux doesn't - at least not up until recently).

(2) posix_fadvise() feels more obscure and less portable than
O_DIRECT, the latter being well-understood and used by e.g. databases
for a long time.

(3) O_DIRECT allows more direct control over when I/O happens and to
what extent (without playing tricks or making assumptions about e.g.
read-ahead) so will probably make it easier to kill both birds with
one stone.

You indicated you were skeptical about writing an I/O scheduler. While
I agree that writing a real I/O scheduler is difficult, I suspect that
if we do direct I/O a fairly simple scheme should work well. Being
able to tweak a target MB/sec rate, select a chunk size ,and select
the time window over which to rate limit, I suspect would go a long
way.

The situation is a bit special since in this case we are talking about
one type of I/O that is run during controlled circumstances
(controlled concurrency, we know how much memory we eat in total,
etc).

I suspect there may be a problem sustaining rates during high read
loads though. We'll see.

I'll try to make time for trying this out.

-- 
/ Peter Schuller

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Jorge Barrios <jo...@tapulous.com>.

Each of my top-level functions was allocating a Hector client connection at
the top, and releasing it when returning. The problem arose when a top-level
function had to call another top-level function, which led to the same
thread allocating two connections. Hector was not releasing one of them even
though I was explicitly requesting them to be released. This might have been
fixed since then, and like I said, I didn't dig into why it was happening. I
just made sure to pass along the connection instances as necessary and the
problem went away.

On Wed, Jul 14, 2010 at 11:40 AM, shimi <sh...@gmail.com> wrote:

> do you mean that you don't release the connection back to fhe pool?
>
> On 2010 7 14 20:51, "Jorge Barrios" <jo...@tapulous.com> wrote:
>
> Thomas, I had a similar problem a few weeks back. I changed my code to make
> sure that each thread only creates and uses one Hector connection. It seems
> that client sockets are not being released properly, but I didn't have the
> time to dig into it.
>
> Jorge
>
>
>
> On Wed, Jul 14, 2010 at 8:28 AM, Peter Schuller <
> peter.schuller@infidyne.com> wrote:
> >
> > > [snip]
> ...
>
>

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by shimi <sh...@gmail.com>.

do you mean that you don't release the connection back to fhe pool?

On 2010 7 14 20:51, "Jorge Barrios" <jo...@tapulous.com> wrote:

Thomas, I had a similar problem a few weeks back. I changed my code to make
sure that each thread only creates and uses one Hector connection. It seems
that client sockets are not being released properly, but I didn't have the
time to dig into it.

Jorge

On Wed, Jul 14, 2010 at 8:28 AM, Peter Schuller <pe...@infidyne.com>
wrote:
>
> > [snip]
...

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Jorge Barrios <jo...@tapulous.com>.

Thomas, I had a similar problem a few weeks back. I changed my code to make
sure that each thread only creates and uses one Hector connection. It seems
that client sockets are not being released properly, but I didn't have the
time to dig into it.

Jorge

On Wed, Jul 14, 2010 at 8:28 AM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > [snip]
> > I'm not sure that is the case.
> >
> > When the server gets into the unrecoverable state, the repeating
> exceptions
> > are indeed "SocketException: Too many open files".
> [snip]
> > Although this is unquestionably a network error,  I don't think it is
> > actually a
> > network problem per se, as the maximum number of sockets open by the
> > Cassandra server is at this point is about 8.  When I kill the client,
> > sockets
> > held are just the listening sockets - no sockets in ESTABLISHED or
> > TIMED_WAIT.
>
> Is this based on netstat or lsof or similar? When the node is in the
> state of giving these errors, try inspecting /proc/<pid>/fd or use
> lsof. Presumably you'll see thousands of fds of some category; either
> sockets or files.
>
> (If you already did this, sorry!)
>
> --
> / Peter Schuller
>

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Peter Schuller <pe...@infidyne.com>.

> [snip]
> I'm not sure that is the case.
>
> When the server gets into the unrecoverable state, the repeating exceptions
> are indeed "SocketException: Too many open files".
[snip]
> Although this is unquestionably a network error,  I don't think it is
> actually a
> network problem per se, as the maximum number of sockets open by the
> Cassandra server is at this point is about 8.  When I kill the client,
> sockets
> held are just the listening sockets - no sockets in ESTABLISHED or
> TIMED_WAIT.

Is this based on netstat or lsof or similar? When the node is in the
state of giving these errors, try inspecting /proc/<pid>/fd or use
lsof. Presumably you'll see thousands of fds of some category; either
sockets or files.

(If you already did this, sorry!)

-- 
/ Peter Schuller

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Thomas Downing <td...@proteus-technologies.com>.

On 7/14/2010 11:07 AM, Jonathan Ellis wrote:
> socketexception means this is coming from the network, not the sstables
>
> knowing the full error message would be nice, but just about any
> problem on that end should be fixed by adding connection pooling to
> your client.
>
> (moving to user@)
>
> On Wed, Jul 14, 2010 at 5:09 AM, Thomas Downing
> <td...@proteus-technologies.com>  wrote:
>    
>> On 7/13/2010 9:20 AM, Jonathan Ellis wrote:
>>      
>>> On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
>>> <td...@proteus-technologies.com>    wrote:
>>>
>>>        
>>>> On a related note:  I am running some feasibility tests looking for
>>>> high ingest rate capabilities.  While testing Cassandra the problem
>>>> I've encountered is that it runs out of file handles during compaction.
>>>>
>>>>          
>>>
[snip]
I'm not sure that is the case.

When the server gets into the unrecoverable state, the repeating exceptions
are indeed "SocketException: Too many open files".

WARN [main] 2010-07-14 06:08:46,772 TThreadPoolServer.java (line 190) 
Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: 
java.net.SocketException: Too many open files
     at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
     at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
     at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
     at 
org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184)
     at 
org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149)
     at 
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190)
Caused by: java.net.SocketException: Too many open files
     at java.net.PlainSocketImpl.socketAccept(Native Method)
     at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
     at java.net.ServerSocket.implAccept(ServerSocket.java:453)
     at java.net.ServerSocket.accept(ServerSocket.java:421)
     at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
     ... 5 more

Although this is unquestionably a network error,  I don't think it is 
actually a
network problem per se, as the maximum number of sockets open by the
Cassandra server is at this point is about 8.  When I kill the client, 
sockets
held are just the listening sockets - no sockets in ESTABLISHED or
TIMED_WAIT.

I was originally using the client interface provided by Hector, but went 
to the
direct thrift API to eliminate moving parts in the puzzle.  When using 
Hector,
I was using the ClientConnectionPool. Either way, the behavior is the same.

Just a further note:  my client test jig acquires a single connection, 
then uses
that connection for successive batch_mutate operations, with out closing.
It only closes the connection on an exception, or at the end of the run.  If
it would be helpful, I can change that to open/mutate/close and repeat to
see what happens.

Thanks
Thomas Downing

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Jonathan Ellis <jb...@gmail.com>.

socketexception means this is coming from the network, not the sstables

knowing the full error message would be nice, but just about any
problem on that end should be fixed by adding connection pooling to
your client.

(moving to user@)

On Wed, Jul 14, 2010 at 5:09 AM, Thomas Downing
<td...@proteus-technologies.com> wrote:
> On 7/13/2010 9:20 AM, Jonathan Ellis wrote:
>>
>> On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
>> <td...@proteus-technologies.com>  wrote:
>>
>>>
>>> On a related note:  I am running some feasibility tests looking for
>>> high ingest rate capabilities.  While testing Cassandra the problem
>>> I've encountered is that it runs out of file handles during compaction.
>>>
>>
>> This usually just means "increase the allowed fh via ulimit/etc."
>>
>> Increasing the memtable thresholds so that you create less sstables,
>> but larger ones, is also a good idea.  The defaults are small so
>> Cassandra can work on a 1GB heap which is much smaller than most
>> production ones.  Reasonable rule of thumb: if you have a heap of N
>> GB, increase both the throughput and count thresholds by N times.
>>
>>
>
> Thanks for the suggestion.  I gave it a whirl, but no go.  The file handles
> in
> in use stayed at around 500 for the first 30M or so mutates, then within
> 4 seconds they jumped to about 800, stayed there for about 30 seconds,
> then within 5 seconds went over 2022, at which point the server entered
> the cycle of "SocketException: Too many open files.  Interesting thing is
> that the file limit for this process is 32768.  Note the numbers below as
> well.
>
> If there is anything specific you would like me to try, let me know.
>
> Seems like there's some sort of non-linear behavior here.  This behavior is
> the same as before I multiplied the Cassandra params by 4 (number of G);
> which leads me to think that increasing limits, whether files or Cassandra
> parameters is likely to be a tail-chasing excercise.
>
> This causes time-out exceptions at the client.  On this exception, my client
> closes the connection, waits a bit, then retries.  After a few hours of this
> the server still had not recovered.
>
> I killed the clients, and watched the server after that.  The file handles
> open
> dropped by 8, and have stayed there.  The server is, of course, not throwing
> SocketException any more.  On the other hand, the server is not doing any
> thing at all.
>
> When there is no client activity, and the server is idle, there are 155
> threads
> running in the JVM.  The all are in one of three states, almost all blocked
> at
> futex( ),  a few blocked at accept( ) , a few cycling over timeout on
> futex(),
> gettimeofday(), futex() ... None are blocked at IO.  I can't attach a
> debugger,
> I get IO exceptions trying either socket or local connections, no surprise,
> so I don't know of a way to get the Java code where the threads are
> blocking.
>
> More than one fd can be open on a given file, and many of open fd's are
> on files that have been deleted.  The stale fd's are all on Data.db files in
> the
> data directory, which I have separate from the commit log directory.
>
> I haven't had a chance to look at the code handling files, and I am not any
> sort of Java expert, but could this be due to Java's lazy resource clean up?
> I wonder if when considering writing your own file handling classes for
> O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.
>
> A restart of the client causes immediate SocketExceptions at the server and
> timeouts at the client.
>
> I noted on the restart that the open fd's jumped by 32, despite only making
> 4 connections.  At this point, there were 2028 open files - more than there
> where when the exceptions began at 2002 open files.  So it seems like the
> exception is not caused by the OS returning EMFILE - unless it was returning
> EMFILE for some strange reason, and the bump in open files is due to an
> increase in duplicate open files.  (BTW, it's not ENFILE!).
>
> I also noted that although the TimeoutExceptions did not  occur immediately
> on the client, the SocketExceptions began immediately on the server.  This
> does not seem to match up.  I am using the org.apache.cassandra.thrift API
> directly, not any higher level wrapper.
>
> Finally, this jump to 2028 on the restart caused a new symptom.  I only had
> the client running a few seconds, but after 15 minutes, the server is still
> throwing
> exceptions, even though the open file handles immediately dropped from
> 2028 down to 1967.
>
> Thanks for your attention, and all your work,
>
> Thomas Downing
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Peter Schuller <pe...@infidyne.com>.

> As a Cassandra newbie, I'm not sure how to tell, but they are all
> to *.Data.db files, and all under the DataFileDirectory (as spec'ed
> in storage-conf.xml), which is a separate directory than the
> CommitLogDirectory.  I did not see any *Index.db or *Filter.db
> files, but I may have missed them.

The *.Data.db files are indeed sstables.

-- 
/ Peter Schuller

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Thomas Downing <td...@proteus-technologies.com>.

On 7/14/2010 7:16 AM, Peter Schuller wrote:
>> More than one fd can be open on a given file, and many of open fd's are
>> on files that have been deleted.  The stale fd's are all on Data.db files in
>> the
>> data directory, which I have separate from the commit log directory.
>>
>> I haven't had a chance to look at the code handling files, and I am not any
>> sort of Java expert, but could this be due to Java's lazy resource clean up?
>> I wonder if when considering writing your own file handling classes for
>> O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.
>>      
> The fact that there are open fds to deleted files is interesting... I
> wonder if people have reported weird disk space usage in the past
> (since such deleted files would not show up with 'du -sh' but eat
> space on the device until closed).
>
> My general understanding is that Cassandra does specifically rely on
> the GC to know when unused sstables can be removed. However the fact
> that the files are deleted I think means that this is not the problem,
> and the question is rather why open file descriptors/streams are
> leaking to these deleted sstables. But I'm speaking now without
> knowing when/where streams are closed.
>
> Are the deleted files indeed sstable, or was that a bad assumption on my part?
>
>    
As a Cassandra newbie, I'm not sure how to tell, but they are all
to *.Data.db files, and all under the DataFileDirectory (as spec'ed
in storage-conf.xml), which is a separate directory than the
CommitLogDirectory.  I did not see any *Index.db or *Filter.db
files, but I may have missed them.

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Peter Schuller <pe...@infidyne.com>.

> More than one fd can be open on a given file, and many of open fd's are
> on files that have been deleted.  The stale fd's are all on Data.db files in
> the
> data directory, which I have separate from the commit log directory.
>
> I haven't had a chance to look at the code handling files, and I am not any
> sort of Java expert, but could this be due to Java's lazy resource clean up?
> I wonder if when considering writing your own file handling classes for
> O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.

The fact that there are open fds to deleted files is interesting... I
wonder if people have reported weird disk space usage in the past
(since such deleted files would not show up with 'du -sh' but eat
space on the device until closed).

My general understanding is that Cassandra does specifically rely on
the GC to know when unused sstables can be removed. However the fact
that the files are deleted I think means that this is not the problem,
and the question is rather why open file descriptors/streams are
leaking to these deleted sstables. But I'm speaking now without
knowing when/where streams are closed.

Are the deleted files indeed sstable, or was that a bad assumption on my part?

-- 
/ Peter Schuller

Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Jonathan Ellis <jb...@gmail.com>.

socketexception means this is coming from the network, not the sstables

knowing the full error message would be nice, but just about any
problem on that end should be fixed by adding connection pooling to
your client.

(moving to user@)

On Wed, Jul 14, 2010 at 5:09 AM, Thomas Downing
<td...@proteus-technologies.com> wrote:
> On 7/13/2010 9:20 AM, Jonathan Ellis wrote:
>>
>> On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
>> <td...@proteus-technologies.com>  wrote:
>>
>>>
>>> On a related note:  I am running some feasibility tests looking for
>>> high ingest rate capabilities.  While testing Cassandra the problem
>>> I've encountered is that it runs out of file handles during compaction.
>>>
>>
>> This usually just means "increase the allowed fh via ulimit/etc."
>>
>> Increasing the memtable thresholds so that you create less sstables,
>> but larger ones, is also a good idea.  The defaults are small so
>> Cassandra can work on a 1GB heap which is much smaller than most
>> production ones.  Reasonable rule of thumb: if you have a heap of N
>> GB, increase both the throughput and count thresholds by N times.
>>
>>
>
> Thanks for the suggestion.  I gave it a whirl, but no go.  The file handles
> in
> in use stayed at around 500 for the first 30M or so mutates, then within
> 4 seconds they jumped to about 800, stayed there for about 30 seconds,
> then within 5 seconds went over 2022, at which point the server entered
> the cycle of "SocketException: Too many open files.  Interesting thing is
> that the file limit for this process is 32768.  Note the numbers below as
> well.
>
> If there is anything specific you would like me to try, let me know.
>
> Seems like there's some sort of non-linear behavior here.  This behavior is
> the same as before I multiplied the Cassandra params by 4 (number of G);
> which leads me to think that increasing limits, whether files or Cassandra
> parameters is likely to be a tail-chasing excercise.
>
> This causes time-out exceptions at the client.  On this exception, my client
> closes the connection, waits a bit, then retries.  After a few hours of this
> the server still had not recovered.
>
> I killed the clients, and watched the server after that.  The file handles
> open
> dropped by 8, and have stayed there.  The server is, of course, not throwing
> SocketException any more.  On the other hand, the server is not doing any
> thing at all.
>
> When there is no client activity, and the server is idle, there are 155
> threads
> running in the JVM.  The all are in one of three states, almost all blocked
> at
> futex( ),  a few blocked at accept( ) , a few cycling over timeout on
> futex(),
> gettimeofday(), futex() ... None are blocked at IO.  I can't attach a
> debugger,
> I get IO exceptions trying either socket or local connections, no surprise,
> so I don't know of a way to get the Java code where the threads are
> blocking.
>
> More than one fd can be open on a given file, and many of open fd's are
> on files that have been deleted.  The stale fd's are all on Data.db files in
> the
> data directory, which I have separate from the commit log directory.
>
> I haven't had a chance to look at the code handling files, and I am not any
> sort of Java expert, but could this be due to Java's lazy resource clean up?
> I wonder if when considering writing your own file handling classes for
> O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.
>
> A restart of the client causes immediate SocketExceptions at the server and
> timeouts at the client.
>
> I noted on the restart that the open fd's jumped by 32, despite only making
> 4 connections.  At this point, there were 2028 open files - more than there
> where when the exceptions began at 2002 open files.  So it seems like the
> exception is not caused by the OS returning EMFILE - unless it was returning
> EMFILE for some strange reason, and the bump in open files is due to an
> increase in duplicate open files.  (BTW, it's not ENFILE!).
>
> I also noted that although the TimeoutExceptions did not  occur immediately
> on the client, the SocketExceptions began immediately on the server.  This
> does not seem to match up.  I am using the org.apache.cassandra.thrift API
> directly, not any higher level wrapper.
>
> Finally, this jump to 2028 on the restart caused a new symptom.  I only had
> the client running a few seconds, but after 15 minutes, the server is still
> throwing
> exceptions, even though the open file handles immediately dropped from
> 2028 down to 1967.
>
> Thanks for your attention, and all your work,
>
> Thomas Downing
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

Posted by Thomas Downing <td...@proteus-technologies.com>.

On 7/13/2010 9:20 AM, Jonathan Ellis wrote:
> On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
> <td...@proteus-technologies.com>  wrote:
>    
>> On a related note:  I am running some feasibility tests looking for
>> high ingest rate capabilities.  While testing Cassandra the problem
>> I've encountered is that it runs out of file handles during compaction.
>>      
> This usually just means "increase the allowed fh via ulimit/etc."
>
> Increasing the memtable thresholds so that you create less sstables,
> but larger ones, is also a good idea.  The defaults are small so
> Cassandra can work on a 1GB heap which is much smaller than most
> production ones.  Reasonable rule of thumb: if you have a heap of N
> GB, increase both the throughput and count thresholds by N times.
>
>    
Thanks for the suggestion.  I gave it a whirl, but no go.  The file 
handles in
in use stayed at around 500 for the first 30M or so mutates, then within
4 seconds they jumped to about 800, stayed there for about 30 seconds,
then within 5 seconds went over 2022, at which point the server entered
the cycle of "SocketException: Too many open files.  Interesting thing is
that the file limit for this process is 32768.  Note the numbers below 
as well.

If there is anything specific you would like me to try, let me know.

Seems like there's some sort of non-linear behavior here.  This behavior is
the same as before I multiplied the Cassandra params by 4 (number of G);
which leads me to think that increasing limits, whether files or Cassandra
parameters is likely to be a tail-chasing excercise.

This causes time-out exceptions at the client.  On this exception, my client
closes the connection, waits a bit, then retries.  After a few hours of this
the server still had not recovered.

I killed the clients, and watched the server after that.  The file 
handles open
dropped by 8, and have stayed there.  The server is, of course, not throwing
SocketException any more.  On the other hand, the server is not doing any
thing at all.

When there is no client activity, and the server is idle, there are 155 
threads
running in the JVM.  The all are in one of three states, almost all 
blocked at
futex( ),  a few blocked at accept( ) , a few cycling over timeout on 
futex(),
gettimeofday(), futex() ... None are blocked at IO.  I can't attach a 
debugger,
I get IO exceptions trying either socket or local connections, no surprise,
so I don't know of a way to get the Java code where the threads are 
blocking.

More than one fd can be open on a given file, and many of open fd's are
on files that have been deleted.  The stale fd's are all on Data.db 
files in the
data directory, which I have separate from the commit log directory.

I haven't had a chance to look at the code handling files, and I am not any
sort of Java expert, but could this be due to Java's lazy resource clean up?
I wonder if when considering writing your own file handling classes for
O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.

A restart of the client causes immediate SocketExceptions at the server and
timeouts at the client.

I noted on the restart that the open fd's jumped by 32, despite only making
4 connections.  At this point, there were 2028 open files - more than there
where when the exceptions began at 2002 open files.  So it seems like the
exception is not caused by the OS returning EMFILE - unless it was returning
EMFILE for some strange reason, and the bump in open files is due to an
increase in duplicate open files.  (BTW, it's not ENFILE!).

I also noted that although the TimeoutExceptions did not  occur immediately
on the client, the SocketExceptions began immediately on the server.  This
does not seem to match up.  I am using the org.apache.cassandra.thrift API
directly, not any higher level wrapper.

Finally, this jump to 2028 on the restart caused a new symptom.  I only had
the client running a few seconds, but after 15 minutes, the server is 
still throwing
exceptions, even though the open file handles immediately dropped from
2028 down to 1967.

Thanks for your attention, and all your work,

Thomas Downing

Re: Minimizing the impact of compaction on latency and throughput

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
<td...@proteus-technologies.com> wrote:
> On a related note:  I am running some feasibility tests looking for
> high ingest rate capabilities.  While testing Cassandra the problem
> I've encountered is that it runs out of file handles during compaction.

This usually just means "increase the allowed fh via ulimit/etc."

Increasing the memtable thresholds so that you create less sstables,
but larger ones, is also a good idea.  The defaults are small so
Cassandra can work on a 1GB heap which is much smaller than most
production ones.  Reasonable rule of thumb: if you have a heap of N
GB, increase both the throughput and count thresholds by N times.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Minimizing the impact of compaction on latency and throughput

Posted by Thomas Downing <td...@proteus-technologies.com>.

On a related note:  I am running some feasibility tests looking for
high ingest rate capabilities.  While testing Cassandra the problem
I've encountered is that it runs out of file handles during compaction.
Until that event, there was no significant impact on throughput as
I was using it (0.1 query per second, ~10,000 records/query).

Up to this point, Cassandra was definitely in the lead among the
alternatives.

This was with 0.6.3, single node installation.  Ingest rate was about
4000 records/second, 1600 bytes/record, 24 bytes/key, using
batch_mutate.  Unfortunately, Cassandra seems unable to recover from
this state.  This occurs at about 100M records in the database.

I tried a 0.7.0 snapshot, but encountered earlier and worse problems.

The machine is 4 CPU AMD64 2.2GHz, 4GB.  There was no swapping.

The only mention of running out of file handles I found in the archives
or the defect list was related to queries - but I am notoriously blind.
I see the same behavior running ingest only, no queries

I've blown away the logs and data, but if there is interest in further info
on this problem, such as stacktrace and specific numbers, I will re-run
the test and send them along.

Thanks much for all your work

Thomas Downing

On 7/12/2010 10:52 PM, Jonathan Ellis wrote:
> This looks relevant:
> http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
> comments for directions to code sample)
>
> On Fri, Jul 9, 2010 at 1:52 AM, Peter Schuller
> <pe...@infidyne.com>  wrote:
>    
>>> It might be worth experimenting with posix_fadvise.  I don't think
>>> implementing our own i/o scheduler or rate-limiter would be as good a
>>> use of time (it sounds like you're on that page too).
>>>        
>> Ok. And yes I mostly agree, although I can imagine circumstances where
>> a pretty simple rate limiter might help significantly - albeit be
>> something that has to be tweaked very specifically for the
>> situation/hardware rather than being auto-tuned.
>>
>> If I have the time I may look into posix_fadvise() to begin with (but
>> I'm not promising anything).
>>
>> Thanks for the input!
>>
>> --
>> / Peter Schuller
>>
>>      
>
>

Re: Minimizing the impact of compaction on latency and throughput

Posted by Jonathan Ellis <jb...@gmail.com>.

This looks relevant:
http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
comments for directions to code sample)

On Fri, Jul 9, 2010 at 1:52 AM, Peter Schuller
<pe...@infidyne.com> wrote:
>> It might be worth experimenting with posix_fadvise.  I don't think
>> implementing our own i/o scheduler or rate-limiter would be as good a
>> use of time (it sounds like you're on that page too).
>
> Ok. And yes I mostly agree, although I can imagine circumstances where
> a pretty simple rate limiter might help significantly - albeit be
> something that has to be tweaked very specifically for the
> situation/hardware rather than being auto-tuned.
>
> If I have the time I may look into posix_fadvise() to begin with (but
> I'm not promising anything).
>
> Thanks for the input!
>
> --
> / Peter Schuller
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Minimizing the impact of compaction on latency and throughput

Posted by Peter Schuller <pe...@infidyne.com>.

> It might be worth experimenting with posix_fadvise.  I don't think
> implementing our own i/o scheduler or rate-limiter would be as good a
> use of time (it sounds like you're on that page too).

Ok. And yes I mostly agree, although I can imagine circumstances where
a pretty simple rate limiter might help significantly - albeit be
something that has to be tweaked very specifically for the
situation/hardware rather than being auto-tuned.

If I have the time I may look into posix_fadvise() to begin with (but
I'm not promising anything).

Thanks for the input!

-- 
/ Peter Schuller

Re: Minimizing the impact of compaction on latency and throughput

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Jul 7, 2010 at 4:57 PM, Peter Schuller
<pe...@infidyne.com> wrote:
>> This makes sense, but from what I have seen, read contention vs
>> cassandra is a much bigger deal than write contention

(meant "read contention vs compaction")

> I am not really concerned with write performance, but rather with
> writes affecting reads. Cache eviction due to streaming writes has the
> obvious effect on hit ratio on reads, and in general large sustained
> writes tend to very negatively affect read latency (and typically in a
> bursty fashion). So; the idea was to specifically optimize to try to
> make reads be minimally affected by writes (in the sense of the
> background compaction eventually resulting from those writes).
>
> Understood and agreed about the commit log (though as long as write
> bursts are within what a battery-backed RAID controller can keep in
> its cache I'd expect it to potentially work pretty well without
> separation, if you do have such a setup).

It might be worth experimenting with posix_fadvise.  I don't think
implementing our own i/o scheduler or rate-limiter would be as good a
use of time (it sounds like you're on that page too).

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Minimizing the impact of compaction on latency and throughput

Posted by Peter Schuller <pe...@infidyne.com>.

> This makes sense, but from what I have seen, read contention vs
> cassandra is a much bigger deal than write contention (unless you
> don't have a separate device for your commitlog, but optimizing for
> that case isn't one of our goals).

I am not really concerned with write performance, but rather with
writes affecting reads. Cache eviction due to streaming writes has the
obvious effect on hit ratio on reads, and in general large sustained
writes tend to very negatively affect read latency (and typically in a
bursty fashion). So; the idea was to specifically optimize to try to
make reads be minimally affected by writes (in the sense of the
background compaction eventually resulting from those writes).

Understood and agreed about the commit log (though as long as write
bursts are within what a battery-backed RAID controller can keep in
its cache I'd expect it to potentially work pretty well without
separation, if you do have such a setup).

-- 
/ Peter Schuller

Re: Minimizing the impact of compaction on latency and throughput

Posted by Jonathan Ellis <jb...@gmail.com>.

This makes sense, but from what I have seen, read contention vs
cassandra is a much bigger deal than write contention (unless you
don't have a separate device for your commitlog, but optimizing for
that case isn't one of our goals).

On Wed, Jul 7, 2010 at 12:09 PM, Peter Schuller
<pe...@infidyne.com> wrote:
> Hello,
>
> I have repeatedly seen users report that background compaction is
> overly detrimental to the behavior of the node with respect to
> latency. While I have not yet deployed cassandra in a production
> situation where latencies are closely monitored, these reports do not
> really sound very surprising to me given the nature of compaction and
> unless otherwise stated by developers here on the list I tend to
> believe that it is a real issue.
>
> Ignoring implementation difficulties for a moment, a few things that
> could improve the situation, that seem sensible to me, are:
>
> * Utilizing posix_fadvise() on both reads and writes to avoid
> obliterating the operating system's caching of the sstables.
> * Add the ability to rate limit disk I/O (in particular writes).
> * Add the ability to perform direct I/O.
> * Add the ability to fsync() regularly on writes to force the
> operating system to not decide to flush hundreds of megabytes of data
> out in a single burst.
> * (Not an improvement but general observation: it seems useless for
> writes to the commit log to remain in cache after an fsync(), and so
> they are a good candidate for posix_fadvise())
>
> None of these would be silver bullets, and the importance and
> appropriate settings for each would be very dependent on operating
> system, hardware, etc. But having the ability to control some or all
> of these should, I suspect, allow significantly lessening the impact
> of compaction under a variety of circumstances.
>
> With respect to cache eviction, the this is one area where the impact
> can probably be expected to be higher the more you rely on the
> operating systems caching, and the less you rely on in-JVM caching
> done by cassandra.
>
> The most obvious problem points to me include:
>
> * posix_fadvise() and direct I/O cause portability and building
> issues, necessitating native code.
> * rate limiting is very indirect due to read-ahead, caching, etc. in
> particular for writes, rate limiting them would likely be almost
> useless without also having fsync() or direct I/O, unless it is rate
> limited to an extremely small amount and the cluster is taking very
> few writes (such that the typical background flushing done by most
> OS:es is done often enough to not imply huge amounts of data)
>
> Any thoughts? Has this already been considered and rejected? Do you
> think compaction is in fact not a problem already? Are there other,
> easier, better ways to accomplish the goal?
>
> --
> / Peter Schuller
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Minimizing the impact of compaction on latency and throughput

Posted by Rishi Bhardwaj <kh...@yahoo.com>.

I have done some bulk write performance tests and I saw background compaction 
making a big detrimental impact on the write performance. I was also wondering 
if there is a tunable to limit the frequency of the compaction on the sstables. 
If not, then adding such a configuration option would also help in controlling 
the performance impact of compaction operation.

-Rishi



________________________________
From: Peter Schuller <pe...@infidyne.com>
To: dev@cassandra.apache.org
Sent: Wed, July 7, 2010 10:09:25 AM
Subject: Minimizing the impact of compaction on latency and throughput

Hello,

I have repeatedly seen users report that background compaction is
overly detrimental to the behavior of the node with respect to
latency. While I have not yet deployed cassandra in a production
situation where latencies are closely monitored, these reports do not
really sound very surprising to me given the nature of compaction and
unless otherwise stated by developers here on the list I tend to
believe that it is a real issue.

Ignoring implementation difficulties for a moment, a few things that
could improve the situation, that seem sensible to me, are:

* Utilizing posix_fadvise() on both reads and writes to avoid
obliterating the operating system's caching of the sstables.
* Add the ability to rate limit disk I/O (in particular writes).
* Add the ability to perform direct I/O.
* Add the ability to fsync() regularly on writes to force the
operating system to not decide to flush hundreds of megabytes of data
out in a single burst.
* (Not an improvement but general observation: it seems useless for
writes to the commit log to remain in cache after an fsync(), and so
they are a good candidate for posix_fadvise())

None of these would be silver bullets, and the importance and
appropriate settings for each would be very dependent on operating
system, hardware, etc. But having the ability to control some or all
of these should, I suspect, allow significantly lessening the impact
of compaction under a variety of circumstances.

With respect to cache eviction, the this is one area where the impact
can probably be expected to be higher the more you rely on the
operating systems caching, and the less you rely on in-JVM caching
done by cassandra.

The most obvious problem points to me include:

* posix_fadvise() and direct I/O cause portability and building
issues, necessitating native code.
* rate limiting is very indirect due to read-ahead, caching, etc. in
particular for writes, rate limiting them would likely be almost
useless without also having fsync() or direct I/O, unless it is rate
limited to an extremely small amount and the cluster is taking very
few writes (such that the typical background flushing done by most
OS:es is done often enough to not imply huge amounts of data)

Any thoughts? Has this already been considered and rejected? Do you
think compaction is in fact not a problem already? Are there other,
easier, better ways to accomplish the goal?

-- 
/ Peter Schuller