You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Héctor Izquierdo Seliva <iz...@strands.com> on 2011/06/08 17:12:47 UTC

Retrieving a column from a fat row vs retrieving a single row

Hi,

I have an index I use to translate ids. I usually only read a column at
a time, and it's becoming a bottleneck. I could rewrite the application
to read a bunch at a time but it would make the application logic much
harder, as it would involve buffering incoming data.

As far as I know, to read a single column cassandra will deserialize a
bunch of them and then pick the correct one (64KB of data right?)

Would it be faster to have a row for each id I want to translate? This
would make keycache less effective, but the amount of data read should
be smaller.

Thanks!

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by Héctor Izquierdo Seliva <iz...@strands.com>.

I think I will follow the advice of better balancing and I will split
the index into several pieces. Thanks everybody for your input!

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by aaron morton <aa...@thelastpickle.com>.

Don't forget that reading at ONE may not mean that only 1 replica is involved in the request.

Any get or multiget (not range scan) read that runs with ReadRepair enabled will be sent to all UP replicas. If the RR is disabled it will only be sent to as many replicas as needed for the CL. For CL ONE the RR happens async to the request, this normally means after the initial request has returned. 

So running with CL One *may* make the request return a little faster, but if RR is enabled you are still asking the cluster to do the same amount of work. And if the client(s) is running in a tight loop reading at ONE makes is easier for the client to overload the server as it can send more work before the server has really finished the previous request. 

You can adjust the probability the RR will run in the yaml config. 

Hope that helps.  

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Jun 2011, at 00:37, Richard Low wrote:

> 2011/6/9 Héctor Izquierdo Seliva <iz...@strands.com>:
> 
>> Yeah, but if I have RF=3 then there are three nodes that can answer the
>> request right?
> 
> Yes, if you're happy to read ConsistencyLevel.ONE.

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by Richard Low <rl...@acunu.com>.

2011/6/9 Héctor Izquierdo Seliva <iz...@strands.com>:

> Yeah, but if I have RF=3 then there are three nodes that can answer the
> request right?

Yes, if you're happy to read ConsistencyLevel.ONE.

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by Héctor Izquierdo Seliva <iz...@strands.com>.

El jue, 09-06-2011 a las 13:28 +0200, Richard Low escribió:
> Remember also that partitioning is done by rows, not columns.  So
> large rows are stored on a single host.  This means they can't be load
> balanced and also all requests to that row will hit one host.  Having
> separate rows will allow load balancing of I/Os.
> 

Yeah, but if I have RF=3 then there are three nodes that can answer the
request right?

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by Richard Low <rl...@acunu.com>.

Remember also that partitioning is done by rows, not columns.  So
large rows are stored on a single host.  This means they can't be load
balanced and also all requests to that row will hit one host.  Having
separate rows will allow load balancing of I/Os.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu

On Thu, Jun 9, 2011 at 12:50 AM, aaron morton <aa...@thelastpickle.com> wrote:
> Just to make things less clear, if you have one row that you are continually
> writing it may end up spread out over several SSTables. Compaction helps
> here to reduce the number of files that must be accessed so long as is can
> keep up. But if you want to read column X and the row is fragmented over 5
> SSTables then each one must be accessed.
>  https://issues.apache.org/jira/browse/CASSANDRA-2319  is open to try and
> reduce the number of seeks.
> For now take a look at nodetool cfhistograms to see how many sstables are
> read for your queries.
> Cheers
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 9 Jun 2011, at 04:50, Peter Schuller wrote:
>
> As far as I know, to read a single column cassandra will deserialize a
>
> bunch of them and then pick the correct one (64KB of data right?)
>
> Assuming the default setting of 64kb, the average amount deserialized
> given random column access should be 8 kb (not true with row cache,
> but with large rows presumably you don't have row cache).
>
> Would it be faster to have a row for each id I want to translate? This
>
> would make keycache less effective, but the amount of data read should
>
> be smaller.
>
> It depends on what bottlenecks you're optimizing for. A key is
> "expensive" in the sense that if (1) increases the size of bloom
> filters for the column family, and it (2) increases the memory cost of
> index sampling, and (3) increases the total data size (typically)
> because the row size is duplicated in both the index and data files.
>
> The cost of deserialization the same data repeatedly is CPU. So if
> you're nowhere near bottlenecking on disk and the memory trade-off is
> reasonable, it may be a suitable optimization. However, consider that
> unless you're doing order preserving partitioning, accessing those
> rows will be effectively random w.r.t. the locations on disk you're
> reading from so you're adding a lot of overhead in terms of disk I/O
> unless your data set fits comfortably in memory.
>
> --
> / Peter Schuller
>
>

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by aaron morton <aa...@thelastpickle.com>.

Just to make things less clear, if you have one row that you are continually writing it may end up spread out over several SSTables. Compaction helps here to reduce the number of files that must be accessed so long as is can keep up. But if you want to read column X and the row is fragmented over 5 SSTables then each one must be accessed. 

 https://issues.apache.org/jira/browse/CASSANDRA-2319  is open to try and reduce the number of seeks. 

For now take a look at nodetool cfhistograms to see how many sstables are read for your queries. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 9 Jun 2011, at 04:50, Peter Schuller wrote:

>> As far as I know, to read a single column cassandra will deserialize a
>> bunch of them and then pick the correct one (64KB of data right?)
> 
> Assuming the default setting of 64kb, the average amount deserialized
> given random column access should be 8 kb (not true with row cache,
> but with large rows presumably you don't have row cache).
> 
>> Would it be faster to have a row for each id I want to translate? This
>> would make keycache less effective, but the amount of data read should
>> be smaller.
> 
> It depends on what bottlenecks you're optimizing for. A key is
> "expensive" in the sense that if (1) increases the size of bloom
> filters for the column family, and it (2) increases the memory cost of
> index sampling, and (3) increases the total data size (typically)
> because the row size is duplicated in both the index and data files.
> 
> The cost of deserialization the same data repeatedly is CPU. So if
> you're nowhere near bottlenecking on disk and the memory trade-off is
> reasonable, it may be a suitable optimization. However, consider that
> unless you're doing order preserving partitioning, accessing those
> rows will be effectively random w.r.t. the locations on disk you're
> reading from so you're adding a lot of overhead in terms of disk I/O
> unless your data set fits comfortably in memory.
> 
> -- 
> / Peter Schuller

Re: Retrieving a column from a fat row vs retrieving a single row

Posted by Peter Schuller <pe...@infidyne.com>.

> As far as I know, to read a single column cassandra will deserialize a
> bunch of them and then pick the correct one (64KB of data right?)

Assuming the default setting of 64kb, the average amount deserialized
given random column access should be 8 kb (not true with row cache,
but with large rows presumably you don't have row cache).

> Would it be faster to have a row for each id I want to translate? This
> would make keycache less effective, but the amount of data read should
> be smaller.

It depends on what bottlenecks you're optimizing for. A key is
"expensive" in the sense that if (1) increases the size of bloom
filters for the column family, and it (2) increases the memory cost of
index sampling, and (3) increases the total data size (typically)
because the row size is duplicated in both the index and data files.

The cost of deserialization the same data repeatedly is CPU. So if
you're nowhere near bottlenecking on disk and the memory trade-off is
reasonable, it may be a suitable optimization. However, consider that
unless you're doing order preserving partitioning, accessing those
rows will be effectively random w.r.t. the locations on disk you're
reading from so you're adding a lot of overhead in terms of disk I/O
unless your data set fits comfortably in memory.

-- 
/ Peter Schuller