You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Matthias Broecheler <me...@matthiasb.com> on 2012/10/15 22:42:59 UTC

RF update

Hey,

we are writing a lot of data into a cassandra cluster for a batch loading
use case. We cannot use the sstable batch loader, so in order to speed up
the loading process we are using RF=1 while the data is loading. After the
load is complete, we want to increase the RF. For that, we are updating the
RF in the schema and then run the node repair tool on each cassandra
instance to stream the data over. However, we are noticing that this
process is slowed down by a lot of compactions (the actually streaming of
data only takes a couple of minutes).

Cassandra is already running a major compaction after the data loading
process has completed. But then, there are to be two more compactions (one
on the sender and one on the receiver) happening and those take a very long
time even on the aws high i/o instance with no compaction throttling.

Question: These additional compactions seem redundant since there are no
reads or writes on the cluster after the first major compaction
(immediately after the data load), is that right? And if so, what can we do
to avoid them? We are currently waiting multiple days.

Thank you very much for your help,
Matthias

Re: how to get column type?

Posted by aaron morton <aa...@thelastpickle.com>.

It depends if C* knows about the column. 

You can check this by looking at the schema in cassandra-cli, see the online help for show schema. 

AFAIK most higher level libraries will get the data type info via the API. They use this to determine the wire format for serialisation. Crack open the client for in your chosen language and take a look if you need programatic access. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/10/2012, at 1:40 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

> This is specifically why Cassandra and even PlayOrm are going the
> direction of "partial schemas".  Everything in cassandra in raw form is
> just bytes.  If you don't tell it the types, it doesn't know how to
> translate it.  PlayOrm and other ORM layers are the same way though in
> these noSQL ORMs you typically have a schema where it is sort of like this
> 
> If(colName.equals("name"))
>   return String.class;
> else if(colName.equals("age"))
>   Return Integer.class;
> 
> So column values are typed such that a command line tool like PlayOrm's
> command line tool can query and know how to translate the results.  Any
> parts of the schema that are not known are just returned in hex.
> 
> So schemaless is cool, but sometimes it is a big pain as well.
> 
> Dean
> 
> On 10/18/12 6:24 AM, "Hagos, A.S." <A....@tue.nl> wrote:
> 
>> Hi all,
>> I am wondering if there is a way to know the column type of an already
>> stored value in  Cassandra.
>> My specific case is to get a column value of a known column name but not
>> type.
>> 
>> greetings 
>> Ambes
>

Re: how to get column type?

Posted by "Hiller, Dean" <De...@nrel.gov>.

This is specifically why Cassandra and even PlayOrm are going the
direction of "partial schemas".  Everything in cassandra in raw form is
just bytes.  If you don't tell it the types, it doesn't know how to
translate it.  PlayOrm and other ORM layers are the same way though in
these noSQL ORMs you typically have a schema where it is sort of like this

If(colName.equals("name"))
   return String.class;
else if(colName.equals("age"))
   Return Integer.class;

So column values are typed such that a command line tool like PlayOrm's
command line tool can query and know how to translate the results.  Any
parts of the schema that are not known are just returned in hex.

So schemaless is cool, but sometimes it is a big pain as well.

Dean

On 10/18/12 6:24 AM, "Hagos, A.S." <A....@tue.nl> wrote:

>Hi all,
>I am wondering if there is a way to know the column type of an already
>stored value in  Cassandra.
>My specific case is to get a column value of a known column name but not
>type.
>
>greetings 
>Ambes

how to get column type?

Posted by "Hagos, A.S." <A....@tue.nl>.

Hi all,
I am wondering if there is a way to know the column type of an already stored value in  Cassandra.
My specific case is to get a column value of a known column name but not type.

greetings 
Ambes

Re: RF update

Posted by aaron morton <aa...@thelastpickle.com>.

> Follow up question: Is it safe to abort the compactions happening after node repair?
It is always safe to abort a compaction. The purpose of compaction is to replicate the current truth in a more compact format. It does not modify data, it just creates new files. The worse case would be killing it between the time the new files are marked as non temp and the time the old files are deleted. That would result in wasted disk space, but the truth in the system would not change. 

> 
> > Question: These additional compactions seem redundant since there are no reads or writes on the cluster after the first major compaction (immediately after the data load), is that right?

Repair transfers a portion of the  -Data.db component from potentially multiple SSTables. This may result in multiple new SStables being created on the receiving node. Once the files are created they are processed in a similar way to when a memtable is flushed and so compaction kicks in.

> And if so, what can we do to avoid them? We are currently waiting multiple days.

That fact that compaction is taking so long is odd. Have you checked the logs for GC problems? if you are running an SSD backed instance and have turned off compaction throttling the high IO throughput can result in mucho garbage. Faster is not always better. 

To improve your situation consider:

* disabling compaction by setting min_compaction_threshold and max_compaction_threshold to 0 via schema or nodetool
* disabling durable_writes to disable the commit log during the bulk load. 

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/10/2012, at 11:55 PM, Matthias Broecheler <me...@matthiasb.com> wrote:

> Follow up question: Is it safe to abort the compactions happening after node repair?
> 
> On Mon, Oct 15, 2012 at 6:32 PM, Will Martin <wi...@voodoolunchbox.com> wrote:
> +1   It doesn't make sense that the xfr compactions are heavy unless they are translating the file. This could be a protocol mismatch: however the requirements for node level compaction and wire compaction I would expect to be pretty different.
> On Oct 15, 2012, at 4:42 PM, Matthias Broecheler wrote:
> 
> > Hey,
> >
> > we are writing a lot of data into a cassandra cluster for a batch loading use case. We cannot use the sstable batch loader, so in order to speed up the loading process we are using RF=1 while the data is loading. After the load is complete, we want to increase the RF. For that, we are updating the RF in the schema and then run the node repair tool on each cassandra instance to stream the data over. However, we are noticing that this process is slowed down by a lot of compactions (the actually streaming of data only takes a couple of minutes).
> >
> > Cassandra is already running a major compaction after the data loading process has completed. But then, there are to be two more compactions (one on the sender and one on the receiver) happening and those take a very long time even on the aws high i/o instance with no compaction throttling.
> >
> > Question: These additional compactions seem redundant since there are no reads or writes on the cluster after the first major compaction (immediately after the data load), is that right? And if so, what can we do to avoid them? We are currently waiting multiple days.
> >
> > Thank you very much for your help,
> > Matthias
> >
> 
> 
> 
> 
> -- 
> Matthias Broecheler, PhD
> http://www.matthiasb.com
> E-Mail: me@matthiasb.com

Re: RF update

Posted by Matthias Broecheler <me...@matthiasb.com>.

Follow up question: Is it safe to abort the compactions happening after
node repair?

On Mon, Oct 15, 2012 at 6:32 PM, Will Martin <wi...@voodoolunchbox.com>wrote:

> +1   It doesn't make sense that the xfr compactions are heavy unless they
> are translating the file. This could be a protocol mismatch: however the
> requirements for node level compaction and wire compaction I would expect
> to be pretty different.
> On Oct 15, 2012, at 4:42 PM, Matthias Broecheler wrote:
>
> > Hey,
> >
> > we are writing a lot of data into a cassandra cluster for a batch
> loading use case. We cannot use the sstable batch loader, so in order to
> speed up the loading process we are using RF=1 while the data is loading.
> After the load is complete, we want to increase the RF. For that, we are
> updating the RF in the schema and then run the node repair tool on each
> cassandra instance to stream the data over. However, we are noticing that
> this process is slowed down by a lot of compactions (the actually streaming
> of data only takes a couple of minutes).
> >
> > Cassandra is already running a major compaction after the data loading
> process has completed. But then, there are to be two more compactions (one
> on the sender and one on the receiver) happening and those take a very long
> time even on the aws high i/o instance with no compaction throttling.
> >
> > Question: These additional compactions seem redundant since there are no
> reads or writes on the cluster after the first major compaction
> (immediately after the data load), is that right? And if so, what can we do
> to avoid them? We are currently waiting multiple days.
> >
> > Thank you very much for your help,
> > Matthias
> >
>
>


-- 
Matthias Broecheler, PhD
http://www.matthiasb.com
E-Mail: me@matthiasb.com

Re: RF update

Posted by Will Martin <wi...@voodoolunchbox.com>.

+1   It doesn't make sense that the xfr compactions are heavy unless they are translating the file. This could be a protocol mismatch: however the requirements for node level compaction and wire compaction I would expect to be pretty different.
On Oct 15, 2012, at 4:42 PM, Matthias Broecheler wrote:

> Hey,
> 
> we are writing a lot of data into a cassandra cluster for a batch loading use case. We cannot use the sstable batch loader, so in order to speed up the loading process we are using RF=1 while the data is loading. After the load is complete, we want to increase the RF. For that, we are updating the RF in the schema and then run the node repair tool on each cassandra instance to stream the data over. However, we are noticing that this process is slowed down by a lot of compactions (the actually streaming of data only takes a couple of minutes).
> 
> Cassandra is already running a major compaction after the data loading process has completed. But then, there are to be two more compactions (one on the sender and one on the receiver) happening and those take a very long time even on the aws high i/o instance with no compaction throttling. 
> 
> Question: These additional compactions seem redundant since there are no reads or writes on the cluster after the first major compaction (immediately after the data load), is that right? And if so, what can we do to avoid them? We are currently waiting multiple days.
> 
> Thank you very much for your help,
> Matthias
>