You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Roland Gude <ro...@yoochoose.com> on 2010/12/16 15:21:51 UTC

Streaming Row Ranges

Hi

In order to access all rows in Cassandra a common pattern is to do multiple range scans and page through them, starting with the last key from the previous result. This introduces a lot of (unnecessary) latency. As the client has to read the result extract the last key and start a new query which Cassandra then has to process.
I think that, from a client perspective it would be nicer in many scenarios just to "ask for all rows in a cf" and to receive some kind of stream and read the rows one by one from that stream instead of receiving all rows and then iterating over them (and being limited by the count of rows). Of course client side libraries could hide the paging stuff, but that would not improve latency.
Is something like this possible? Is it perhaps already implemented?


Greetings,
roland
--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.gude@yoochoose.com
WWW: www.yoochoose.com<http://www.yoochoose.com/>

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln

Re: Streaming Row Ranges

Posted by Peter Schuller <pe...@infidyne.com>.

> I think that, from a client perspective it would be nicer in many scenarios
> just to “ask for all rows in a cf” and to receive some kind of stream and
> read the rows one by one from that stream instead of receiving all rows and
> then iterating over them (and being limited by the count of rows). Of course
> client side libraries could hide the paging stuff, but that would not
> improve latency.

Well, a high-level client could pre-fetch pages asynchronously such
that the latency issue goes away (given sufficient read-ahead).
Assuming a reasonably sized page size/count, hopefully the latency is
not huge relative to the time it takes to do the actual work. Further
performance (in terms of a single client, not overall throughput)
could be had by increasing concurrency (i.e., still doing read-ahead
of pages but pre-fetching multiple at the same time - within reason).

Not saying that true streaming wouldn't be nice though.

> Is something like this possible? Is it perhaps already implemented?

Not implemented AFAIK; certainly possible though non-trivial (e.g.,
thrift doesn't directly support streaming so as long as thrift is
used, an underlying request/response oriented approach would be needed
anyway). I can't speak to what plans are, so leaving that for someone
else... But my personal feel is that simply implementing pre-fetching
paging in higher-level clients seem easier to pull off than
orchestrating proper streaming support natively in Cassandra
internally and it's wire level API. But maybe I'm being too paranoid
about the issues involved; if I'm way off maybe someone more familiar
with the code base will correct me.

-- 
/ Peter Schuller