You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by DuyHai Doan <do...@gmail.com> on 2019/09/28 20:22:47 UTC

Cluster sizing for huge dataset

Hello users

I'm facing with a very challenging exercise: size a cluster with a huge dataset.

Use-case = IoT

Number of sensors: 30 millions
Frequency of data: every 10 minutes
Estimate size of a data: 100 bytes (including clustering columns)
Data retention: 2 years
Replication factor: 3 (pretty standard)

A very quick math gives me:

6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor

In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor

Now the big problem is that we have 30 millions of sensor so the disk
requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
worth of data/year

We want to store data for 2 years => 300Tb

We have RF=3 ==> 900Tb !!!!

Now, according to commonly recommended density (with SSD), one shall
not exceed 2Tb of data per node, which give us a rough sizing of 450
nodes cluster !!!

Even if we push the limit up to 10Tb using TWCS (has anyone tried this
?) We would still need 90 beefy nodes to support this.

Any thoughts/ideas to reduce the nodes count or increase density and
keep the cluster manageable ?

Regards

Duy Hai DOAN

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by DuyHai Doan <do...@gmail.com>.

The problem is that the user wants to access old data also using cql, not
popping un a Sparksql just to fetch one or two old records

Le 4 oct. 2019 12:38, "Cedrick Lunven" <ce...@datastax.com> a
écrit :

> Hi,
>
> If you are using DataStax Enterprise why not offloading cold data to DSEFS
> (HDFS implementation) with friendly analytics storage format like parquet,
> keep only OLTP in the Cassandra Tables. Recommended size for DSEFS can go
> up to 30TB a node.
>
> I am pretty sure you are already aware of this option and would be curious
> to get your think about this solution and limitations.
>
> Note: that would also probably help you with your init-load/TWCS issue .
>
> My2c.
> Cedrick
>
> On Tue, Oct 1, 2019 at 11:49 PM DuyHai Doan <do...@gmail.com> wrote:
>
>> The client wants to be able to access cold data (2 years old) in the
>> same cluster so moving data to another system is not possible
>>
>> However, since we're using Datastax Enterprise, we can leverage Tiered
>> Storage and store old data on Spinning Disks to save on hardware
>>
>> Regards
>>
>> On Tue, Oct 1, 2019 at 9:47 AM Julien Laurenceau
>> <ju...@pepitedata.com> wrote:
>> >
>> > Hi,
>> > Depending on the use case, you may also consider storage tiering with
>> fresh data on hot-tier (Cassandra) and older data on cold-tier
>> (Spark/Parquet or Presto/Parquet). It would be a lot more complex, but may
>> fit more appropriately the budget and you may reuse some tech already
>> present in your environment.
>> > You may even do subsampling during the transformation offloading data
>> from Cassandra in order to keep one point out of 10 for older data if
>> subsampling makes sense for your data signal.
>> >
>> > Regards
>> > Julien
>> >
>> > Le lun. 30 sept. 2019 à 22:03, DuyHai Doan <do...@gmail.com> a
>> écrit :
>> >>
>> >> Thanks all for your reply
>> >>
>> >> The target deployment is on Azure so with the Nice disk snapshot
>> feature, replacing a dead node is easier, no streaming from Cassandra
>> >>
>> >> About compaction overhead, using TwCs with a 1 day bucket and removing
>> read repair and subrange repair should be sufficient
>> >>
>> >> Now the only remaining issue is Quorum read which triggers repair
>> automagically
>> >>
>> >> Before 4.0  there is no flag to turn it off unfortunately
>> >>
>> >> Le 30 sept. 2019 15:47, "Eric Evans" <jo...@gmail.com> a
>> écrit :
>> >>
>> >> On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:
>> >>
>> >> [ ... ]
>> >>
>> >> > 2) The 2TB guidance is old and irrelevant for most people, what you
>> really care about is how fast you can replace the failed machine
>> >> >
>> >> > You’d likely be ok going significantly larger than that if you use a
>> few vnodes, since that’ll help rebuild faster (you’ll stream from more
>> sources on rebuild)
>> >> >
>> >> > If you don’t want to use vnodes, buy big machines and run multiple
>> Cassandra instances in it - it’s not hard to run 3-4TB per instance and
>> 12-16T of SSD per machine
>> >>
>> >> We do this too.  It's worth keeping in mind though that you'll still
>> >> have a 12-16T blast radius in the event of a host failure.  As the
>> >> host density goes up, consider steps to make the host more robust
>> >> (RAID, redundant power supplies, etc).
>> >>
>> >> --
>> >> Eric Evans
>> >> john.eric.evans@gmail.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> >> For additional commands, e-mail: user-help@cassandra.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>
>
> --
>
>
> Cedrick Lunven | EMEA Developer Advocate Manager
>
>
> <https://www.linkedin.com/in/clunven/> <https://twitter.com/clunven>
> <https://clun.github.io/> <https://github.com/clun/>
>
>
> ❓Ask us your questions : *DataStax Community
> <https://community.datastax.com/index.html>*
>
> 🔬Test our new products : *DataStax Labs
> <https://downloads.datastax.com/#labs>*
>
>
>
> <https://constellation.datastax.com/?utm_campaign=FY20Q2_CONSTELLATION&utm_medium=email&utm_source=signature>
>
>
>

Re: Cluster sizing for huge dataset

Posted by Cedrick Lunven <ce...@datastax.com>.

Hi,

If you are using DataStax Enterprise why not offloading cold data to DSEFS
(HDFS implementation) with friendly analytics storage format like parquet,
keep only OLTP in the Cassandra Tables. Recommended size for DSEFS can go
up to 30TB a node.

I am pretty sure you are already aware of this option and would be curious
to get your think about this solution and limitations.

Note: that would also probably help you with your init-load/TWCS issue .

My2c.
Cedrick

On Tue, Oct 1, 2019 at 11:49 PM DuyHai Doan <do...@gmail.com> wrote:

> The client wants to be able to access cold data (2 years old) in the
> same cluster so moving data to another system is not possible
>
> However, since we're using Datastax Enterprise, we can leverage Tiered
> Storage and store old data on Spinning Disks to save on hardware
>
> Regards
>
> On Tue, Oct 1, 2019 at 9:47 AM Julien Laurenceau
> <ju...@pepitedata.com> wrote:
> >
> > Hi,
> > Depending on the use case, you may also consider storage tiering with
> fresh data on hot-tier (Cassandra) and older data on cold-tier
> (Spark/Parquet or Presto/Parquet). It would be a lot more complex, but may
> fit more appropriately the budget and you may reuse some tech already
> present in your environment.
> > You may even do subsampling during the transformation offloading data
> from Cassandra in order to keep one point out of 10 for older data if
> subsampling makes sense for your data signal.
> >
> > Regards
> > Julien
> >
> > Le lun. 30 sept. 2019 à 22:03, DuyHai Doan <do...@gmail.com> a
> écrit :
> >>
> >> Thanks all for your reply
> >>
> >> The target deployment is on Azure so with the Nice disk snapshot
> feature, replacing a dead node is easier, no streaming from Cassandra
> >>
> >> About compaction overhead, using TwCs with a 1 day bucket and removing
> read repair and subrange repair should be sufficient
> >>
> >> Now the only remaining issue is Quorum read which triggers repair
> automagically
> >>
> >> Before 4.0  there is no flag to turn it off unfortunately
> >>
> >> Le 30 sept. 2019 15:47, "Eric Evans" <jo...@gmail.com> a
> écrit :
> >>
> >> On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:
> >>
> >> [ ... ]
> >>
> >> > 2) The 2TB guidance is old and irrelevant for most people, what you
> really care about is how fast you can replace the failed machine
> >> >
> >> > You’d likely be ok going significantly larger than that if you use a
> few vnodes, since that’ll help rebuild faster (you’ll stream from more
> sources on rebuild)
> >> >
> >> > If you don’t want to use vnodes, buy big machines and run multiple
> Cassandra instances in it - it’s not hard to run 3-4TB per instance and
> 12-16T of SSD per machine
> >>
> >> We do this too.  It's worth keeping in mind though that you'll still
> >> have a 12-16T blast radius in the event of a host failure.  As the
> >> host density goes up, consider steps to make the host more robust
> >> (RAID, redundant power supplies, etc).
> >>
> >> --
> >> Eric Evans
> >> john.eric.evans@gmail.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> >> For additional commands, e-mail: user-help@cassandra.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

-- 


Cedrick Lunven | EMEA Developer Advocate Manager


<https://www.linkedin.com/in/clunven/> <https://twitter.com/clunven>
<https://clun.github.io/> <https://github.com/clun/>


❓Ask us your questions : *DataStax Community
<https://community.datastax.com/index.html>*

🔬Test our new products : *DataStax Labs
<https://downloads.datastax.com/#labs>*


<https://constellation.datastax.com/?utm_campaign=FY20Q2_CONSTELLATION&utm_medium=email&utm_source=signature>

Re: Cluster sizing for huge dataset

Posted by DuyHai Doan <do...@gmail.com>.

The client wants to be able to access cold data (2 years old) in the
same cluster so moving data to another system is not possible

However, since we're using Datastax Enterprise, we can leverage Tiered
Storage and store old data on Spinning Disks to save on hardware

Regards

On Tue, Oct 1, 2019 at 9:47 AM Julien Laurenceau
<ju...@pepitedata.com> wrote:
>
> Hi,
> Depending on the use case, you may also consider storage tiering with fresh data on hot-tier (Cassandra) and older data on cold-tier (Spark/Parquet or Presto/Parquet). It would be a lot more complex, but may fit more appropriately the budget and you may reuse some tech already present in your environment.
> You may even do subsampling during the transformation offloading data from Cassandra in order to keep one point out of 10 for older data if subsampling makes sense for your data signal.
>
> Regards
> Julien
>
> Le lun. 30 sept. 2019 à 22:03, DuyHai Doan <do...@gmail.com> a écrit :
>>
>> Thanks all for your reply
>>
>> The target deployment is on Azure so with the Nice disk snapshot feature, replacing a dead node is easier, no streaming from Cassandra
>>
>> About compaction overhead, using TwCs with a 1 day bucket and removing read repair and subrange repair should be sufficient
>>
>> Now the only remaining issue is Quorum read which triggers repair automagically
>>
>> Before 4.0  there is no flag to turn it off unfortunately
>>
>> Le 30 sept. 2019 15:47, "Eric Evans" <jo...@gmail.com> a écrit :
>>
>> On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:
>>
>> [ ... ]
>>
>> > 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine
>> >
>> > You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild)
>> >
>> > If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine
>>
>> We do this too.  It's worth keeping in mind though that you'll still
>> have a 12-16T blast radius in the event of a host failure.  As the
>> host density goes up, consider steps to make the host more robust
>> (RAID, redundant power supplies, etc).
>>
>> --
>> Eric Evans
>> john.eric.evans@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by Julien Laurenceau <ju...@pepitedata.com>.

Hi,
Depending on the use case, you may also consider storage tiering with fresh
data on hot-tier (Cassandra) and older data on cold-tier (Spark/Parquet or
Presto/Parquet). It would be a lot more complex, but may fit more
appropriately the budget and you may reuse some tech already present in
your environment.
You may even do subsampling during the transformation offloading data from
Cassandra in order to keep one point out of 10 for older data if
subsampling makes sense for your data signal.

Regards
Julien

Le lun. 30 sept. 2019 à 22:03, DuyHai Doan <do...@gmail.com> a écrit :

> Thanks all for your reply
>
> The target deployment is on Azure so with the Nice disk snapshot feature,
> replacing a dead node is easier, no streaming from Cassandra
>
> About compaction overhead, using TwCs with a 1 day bucket and removing
> read repair and subrange repair should be sufficient
>
> Now the only remaining issue is Quorum read which triggers repair
> automagically
>
> Before 4.0  there is no flag to turn it off unfortunately
>
> Le 30 sept. 2019 15:47, "Eric Evans" <jo...@gmail.com> a écrit :
>
> On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
> [ ... ]
>
> > 2) The 2TB guidance is old and irrelevant for most people, what you
> really care about is how fast you can replace the failed machine
> >
> > You’d likely be ok going significantly larger than that if you use a few
> vnodes, since that’ll help rebuild faster (you’ll stream from more sources
> on rebuild)
> >
> > If you don’t want to use vnodes, buy big machines and run multiple
> Cassandra instances in it - it’s not hard to run 3-4TB per instance and
> 12-16T of SSD per machine
>
> We do this too.  It's worth keeping in mind though that you'll still
> have a 12-16T blast radius in the event of a host failure.  As the
> host density goes up, consider steps to make the host more robust
> (RAID, redundant power supplies, etc).
>
> --
> Eric Evans
> john.eric.evans@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
>

Re: Cluster sizing for huge dataset

Posted by DuyHai Doan <do...@gmail.com>.

Thanks all for your reply

The target deployment is on Azure so with the Nice disk snapshot feature,
replacing a dead node is easier, no streaming from Cassandra

About compaction overhead, using TwCs with a 1 day bucket and removing read
repair and subrange repair should be sufficient

Now the only remaining issue is Quorum read which triggers repair
automagically

Before 4.0  there is no flag to turn it off unfortunately

Le 30 sept. 2019 15:47, "Eric Evans" <jo...@gmail.com> a écrit :

On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:

[ ... ]

> 2) The 2TB guidance is old and irrelevant for most people, what you
really care about is how fast you can replace the failed machine
>
> You’d likely be ok going significantly larger than that if you use a few
vnodes, since that’ll help rebuild faster (you’ll stream from more sources
on rebuild)
>
> If you don’t want to use vnodes, buy big machines and run multiple
Cassandra instances in it - it’s not hard to run 3-4TB per instance and
12-16T of SSD per machine

We do this too.  It's worth keeping in mind though that you'll still
have a 12-16T blast radius in the event of a host failure.  As the
host density goes up, consider steps to make the host more robust
(RAID, redundant power supplies, etc).

-- 
Eric Evans
john.eric.evans@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by Eric Evans <jo...@gmail.com>.

On Sat, Sep 28, 2019 at 8:50 PM Jeff Jirsa <jj...@gmail.com> wrote:

[ ... ]

> 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine
>
> You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild)
>
> If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine

We do this too.  It's worth keeping in mind though that you'll still
have a 12-16T blast radius in the event of a host failure.  As the
host density goes up, consider steps to make the host more robust
(RAID, redundant power supplies, etc).

-- 
Eric Evans
john.eric.evans@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by Laxmikant Upadhyay <la...@gmail.com>.

I noticed that the compaction overhead has not been taken into account
while capacity planning, I think it is due to the used compression is going
to compensate for that. Is my assumption correct?

On Sun, Sep 29, 2019 at 11:04 PM Jeff Jirsa <jj...@gmail.com> wrote:

>
>
> > On Sep 29, 2019, at 12:30 AM, DuyHai Doan <do...@gmail.com> wrote:
> >
> > Thank you Jeff for the hints
> >
> > We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
> > the new token allocation algo). Also we will try the new zstd
> > compression.
>
> I’d provably still be inclined to run two instances per machine for 20TB
> machines unless you’re planning on using 4.0
>
> >
> > About transient replication, the underlying trade-offs and semantics
> > are hard to understand for common people (for example, reading at CL
> > ONE in the face of 2 full replicas loss leads to unavailable
> > exception, unlike normal replication) so we will let it out for the
> > moment
>
> Yea in transient you’d be restoring from backup in this case, but to be
> fair, you’d have violated consistency / lost data written at quorum if two
> replicas fail even without transient replication using RF=3
>
> >
> > Regards
> >
> >> On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jj...@gmail.com> wrote:
> >>
> >> A few random thoughts here
> >>
> >> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is
> a manageable size.
> >>
> >> 2) The 2TB guidance is old and irrelevant for most people, what you
> really care about is how fast you can replace the failed machine
> >>
> >> You’d likely be ok going significantly larger than that if you use a
> few vnodes, since that’ll help rebuild faster (you’ll stream from more
> sources on rebuild)
> >>
> >> If you don’t want to use vnodes, buy big machines and run multiple
> Cassandra instances in it - it’s not hard to run 3-4TB per instance and
> 12-16T of SSD per machine
> >>
> >> 3) Transient replication in 4.0 could potentially be worth trying out,
> depending on your risk tolerance. Doing 2 full and one transient replica
> may save you 30% storage
> >>
> >> 4) Note that you’re not factoring in compression, and some of the
> recent zstd work may go a long way if your sensor data is similar /
> compressible.
> >>
> >>>> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <do...@gmail.com>
> wrote:
> >>>
> >>> Hello users
> >>>
> >>> I'm facing with a very challenging exercise: size a cluster with a
> huge dataset.
> >>>
> >>> Use-case = IoT
> >>>
> >>> Number of sensors: 30 millions
> >>> Frequency of data: every 10 minutes
> >>> Estimate size of a data: 100 bytes (including clustering columns)
> >>> Data retention: 2 years
> >>> Replication factor: 3 (pretty standard)
> >>>
> >>> A very quick math gives me:
> >>>
> >>> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> >>>
> >>> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year
> /sensor
> >>>
> >>> Now the big problem is that we have 30 millions of sensor so the disk
> >>> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> >>> worth of data/year
> >>>
> >>> We want to store data for 2 years => 300Tb
> >>>
> >>> We have RF=3 ==> 900Tb !!!!
> >>>
> >>> Now, according to commonly recommended density (with SSD), one shall
> >>> not exceed 2Tb of data per node, which give us a rough sizing of 450
> >>> nodes cluster !!!
> >>>
> >>> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> >>> ?) We would still need 90 beefy nodes to support this.
> >>>
> >>> Any thoughts/ideas to reduce the nodes count or increase density and
> >>> keep the cluster manageable ?
> >>>
> >>> Regards
> >>>
> >>> Duy Hai DOAN
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: user-help@cassandra.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> >> For additional commands, e-mail: user-help@cassandra.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: user-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

-- 

regards,
Laxmikant Upadhyay

Re: Cluster sizing for huge dataset

Posted by Jeff Jirsa <jj...@gmail.com>.


> On Sep 29, 2019, at 12:30 AM, DuyHai Doan <do...@gmail.com> wrote:
> 
> Thank you Jeff for the hints
> 
> We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
> the new token allocation algo). Also we will try the new zstd
> compression.

I’d provably still be inclined to run two instances per machine for 20TB machines unless you’re planning on using 4.0

> 
> About transient replication, the underlying trade-offs and semantics
> are hard to understand for common people (for example, reading at CL
> ONE in the face of 2 full replicas loss leads to unavailable
> exception, unlike normal replication) so we will let it out for the
> moment

Yea in transient you’d be restoring from backup in this case, but to be fair, you’d have violated consistency / lost data written at quorum if two replicas fail even without transient replication using RF=3

> 
> Regards
> 
>> On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jj...@gmail.com> wrote:
>> 
>> A few random thoughts here
>> 
>> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size.
>> 
>> 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine
>> 
>> You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild)
>> 
>> If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine
>> 
>> 3) Transient replication in 4.0 could potentially be worth trying out, depending on your risk tolerance. Doing 2 full and one transient replica may save you 30% storage
>> 
>> 4) Note that you’re not factoring in compression, and some of the recent zstd work may go a long way if your sensor data is similar / compressible.
>> 
>>>> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <do...@gmail.com> wrote:
>>> 
>>> Hello users
>>> 
>>> I'm facing with a very challenging exercise: size a cluster with a huge dataset.
>>> 
>>> Use-case = IoT
>>> 
>>> Number of sensors: 30 millions
>>> Frequency of data: every 10 minutes
>>> Estimate size of a data: 100 bytes (including clustering columns)
>>> Data retention: 2 years
>>> Replication factor: 3 (pretty standard)
>>> 
>>> A very quick math gives me:
>>> 
>>> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
>>> 
>>> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
>>> 
>>> Now the big problem is that we have 30 millions of sensor so the disk
>>> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
>>> worth of data/year
>>> 
>>> We want to store data for 2 years => 300Tb
>>> 
>>> We have RF=3 ==> 900Tb !!!!
>>> 
>>> Now, according to commonly recommended density (with SSD), one shall
>>> not exceed 2Tb of data per node, which give us a rough sizing of 450
>>> nodes cluster !!!
>>> 
>>> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
>>> ?) We would still need 90 beefy nodes to support this.
>>> 
>>> Any thoughts/ideas to reduce the nodes count or increase density and
>>> keep the cluster manageable ?
>>> 
>>> Regards
>>> 
>>> Duy Hai DOAN
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by DuyHai Doan <do...@gmail.com>.

Thank you Jeff for the hints

We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
the new token allocation algo). Also we will try the new zstd
compression.

About transient replication, the underlying trade-offs and semantics
are hard to understand for common people (for example, reading at CL
ONE in the face of 2 full replicas loss leads to unavailable
exception, unlike normal replication) so we will let it out for the
moment

Regards

On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jj...@gmail.com> wrote:
>
> A few random thoughts here
>
> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size.
>
> 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine
>
> You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild)
>
> If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine
>
> 3) Transient replication in 4.0 could potentially be worth trying out, depending on your risk tolerance. Doing 2 full and one transient replica may save you 30% storage
>
> 4) Note that you’re not factoring in compression, and some of the recent zstd work may go a long way if your sensor data is similar / compressible.
>
> > On Sep 28, 2019, at 1:23 PM, DuyHai Doan <do...@gmail.com> wrote:
> >
> > Hello users
> >
> > I'm facing with a very challenging exercise: size a cluster with a huge dataset.
> >
> > Use-case = IoT
> >
> > Number of sensors: 30 millions
> > Frequency of data: every 10 minutes
> > Estimate size of a data: 100 bytes (including clustering columns)
> > Data retention: 2 years
> > Replication factor: 3 (pretty standard)
> >
> > A very quick math gives me:
> >
> > 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> >
> > In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> >
> > Now the big problem is that we have 30 millions of sensor so the disk
> > requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> > worth of data/year
> >
> > We want to store data for 2 years => 300Tb
> >
> > We have RF=3 ==> 900Tb !!!!
> >
> > Now, according to commonly recommended density (with SSD), one shall
> > not exceed 2Tb of data per node, which give us a rough sizing of 450
> > nodes cluster !!!
> >
> > Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> > ?) We would still need 90 beefy nodes to support this.
> >
> > Any thoughts/ideas to reduce the nodes count or increase density and
> > keep the cluster manageable ?
> >
> > Regards
> >
> > Duy Hai DOAN
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: user-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Cluster sizing for huge dataset

Posted by Jeff Jirsa <jj...@gmail.com>.

A few random thoughts here

1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size. 

2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine

You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild)

If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine 

3) Transient replication in 4.0 could potentially be worth trying out, depending on your risk tolerance. Doing 2 full and one transient replica may save you 30% storage 

4) Note that you’re not factoring in compression, and some of the recent zstd work may go a long way if your sensor data is similar / compressible.

> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <do...@gmail.com> wrote:
> 
> Hello users
> 
> I'm facing with a very challenging exercise: size a cluster with a huge dataset.
> 
> Use-case = IoT
> 
> Number of sensors: 30 millions
> Frequency of data: every 10 minutes
> Estimate size of a data: 100 bytes (including clustering columns)
> Data retention: 2 years
> Replication factor: 3 (pretty standard)
> 
> A very quick math gives me:
> 
> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> 
> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> 
> Now the big problem is that we have 30 millions of sensor so the disk
> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> worth of data/year
> 
> We want to store data for 2 years => 300Tb
> 
> We have RF=3 ==> 900Tb !!!!
> 
> Now, according to commonly recommended density (with SSD), one shall
> not exceed 2Tb of data per node, which give us a rough sizing of 450
> nodes cluster !!!
> 
> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> ?) We would still need 90 beefy nodes to support this.
> 
> Any thoughts/ideas to reduce the nodes count or increase density and
> keep the cluster manageable ?
> 
> Regards
> 
> Duy Hai DOAN
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org