You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Aaron Turner <sy...@gmail.com> on 2012/08/25 02:56:07 UTC

optimizing use of sstableloader / SSTableSimpleUnsortedWriter

So I've read: http://www.datastax.com/dev/blog/bulk-loading

Are there any tips for using sstableloader /
SSTableSimpleUnsortedWriter to migrate time series data from a our old
datastore (PostgreSQL) to Cassandra?  After thinking about how
sstables are done on disk, it seems best (required??) to write out
each row at once.  Ie: if each row == 1 years worth of data and you
have say 30,000 rows, write one full row at a time (a full years worth
of data points for a given metric) rather then 1 data point for 30,000
rows.

Any other tips to improve load time or reduce the load on the cluster
or subsequent compaction activity?   All my CF's I'll be writing to
use compression and leveled compaction.

Right now my Cassandra data store has about 4 months of data and we
have 5 years of historical (not sure yet how much we'll actually load
yet, but minimally 1 years worth).

Thanks!

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: optimizing use of sstableloader / SSTableSimpleUnsortedWriter

Posted by aaron morton <aa...@thelastpickle.com>.

> dataset... just under 4 months of data is less then 2GB!  I'm pretty
> thrilled.
Be thrilled by all the compressions ! :)

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 28/08/2012, at 6:10 AM, Aaron Turner <sy...@gmail.com> wrote:

> On Mon, Aug 27, 2012 at 1:19 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> After thinking about how
>> sstables are done on disk, it seems best (required??) to write out
>> each row at once.
>> 
>> Sort of. We only want one instance of the row per SSTable created.
> 
> Ah, good clarification, although I think for my purposes they're one
> in the same.
> 
> 
>> Any other tips to improve load time or reduce the load on the cluster
>> or subsequent compaction activity?
>> 
>> Less SSTables means less compaction. So go as high as you can on the
>> bufferSizeInMB param for the
>> SSTableSimpleUnsortedWriter.
> 
> Ok.
> 
>> There is also a SSTableSimpleWriter. Because it expects rows to be ordered
>> it does not buffer and can create bigger sstables.
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableSimpleWriter.java
> 
> Hmmm.... prolly not realistic in my situation... doing so would likely
> thrash the disks on my PG server a lot more and kill my read
> throughput and that server is already hitting a wall.
> 
>> 
>> Right now my Cassandra data store has about 4 months of data and we
>> have 5 years of historical
>> 
>> ingest all the histories!
> 
> Actually, I was a little worried about how much space that would
> take... my estimates was ~305GB/year, which is a lot when you consider
> the 300-400GB/node limit (something I didn't know about at the time).
> However, compression has turned out to be extremely efficient on my
> dataset... just under 4 months of data is less then 2GB!  I'm pretty
> thrilled.
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: optimizing use of sstableloader / SSTableSimpleUnsortedWriter

Posted by Aaron Turner <sy...@gmail.com>.

On Mon, Aug 27, 2012 at 1:19 AM, aaron morton <aa...@thelastpickle.com> wrote:
> After thinking about how
> sstables are done on disk, it seems best (required??) to write out
> each row at once.
>
> Sort of. We only want one instance of the row per SSTable created.

Ah, good clarification, although I think for my purposes they're one
in the same.

> Any other tips to improve load time or reduce the load on the cluster
> or subsequent compaction activity?
>
> Less SSTables means less compaction. So go as high as you can on the
> bufferSizeInMB param for the
> SSTableSimpleUnsortedWriter.

Ok.

> There is also a SSTableSimpleWriter. Because it expects rows to be ordered
> it does not buffer and can create bigger sstables.
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableSimpleWriter.java

Hmmm.... prolly not realistic in my situation... doing so would likely
thrash the disks on my PG server a lot more and kill my read
throughput and that server is already hitting a wall.

>
> Right now my Cassandra data store has about 4 months of data and we
> have 5 years of historical
>
> ingest all the histories!

Actually, I was a little worried about how much space that would
take... my estimates was ~305GB/year, which is a lot when you consider
the 300-400GB/node limit (something I didn't know about at the time).
However, compression has turned out to be extremely efficient on my
dataset... just under 4 months of data is less then 2GB!  I'm pretty
thrilled.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Re: optimizing use of sstableloader / SSTableSimpleUnsortedWriter

Posted by aaron morton <aa...@thelastpickle.com>.

> After thinking about how
> sstables are done on disk, it seems best (required??) to write out
> each row at once.  
Sort of. We only want one instance of the row per SSTable created. 


> Any other tips to improve load time or reduce the load on the cluster
> or subsequent compaction activity? 

Less SSTables means less compaction. So go as high as you can on the bufferSizeInMB param for the 
SSTableSimpleUnsortedWriter. 

There is also a SSTableSimpleWriter. Because it expects rows to be ordered it does not buffer and can create bigger sstables.
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableSimpleWriter.java


> Right now my Cassandra data store has about 4 months of data and we
> have 5 years of historical 
ingest all the histories!

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/08/2012, at 12:56 PM, Aaron Turner <sy...@gmail.com> wrote:

> So I've read: http://www.datastax.com/dev/blog/bulk-loading
> 
> Are there any tips for using sstableloader /
> SSTableSimpleUnsortedWriter to migrate time series data from a our old
> datastore (PostgreSQL) to Cassandra?  After thinking about how
> sstables are done on disk, it seems best (required??) to write out
> each row at once.  Ie: if each row == 1 years worth of data and you
> have say 30,000 rows, write one full row at a time (a full years worth
> of data points for a given metric) rather then 1 data point for 30,000
> rows.
> 
> Any other tips to improve load time or reduce the load on the cluster
> or subsequent compaction activity?   All my CF's I'll be writing to
> use compression and leveled compaction.
> 
> Right now my Cassandra data store has about 4 months of data and we
> have 5 years of historical (not sure yet how much we'll actually load
> yet, but minimally 1 years worth).
> 
> Thanks!
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>    -- Benjamin Franklin
> "carpe diem quam minimum credula postero"