You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Lapo Luchini <la...@lapo.it> on 2021/04/08 13:56:20 UTC

Huge single-node DCs (?)

Hi, one project I wrote is using Cassandra to back the huge amount of 
data it needs (data is written only once and read very rarely, but needs 
to be accessible for years, so the storage needs become huge in time and 
I chose Cassandra mainly for its horizontal scalability regarding disk 
size) and a client of mine needs to install that on his hosts.

Problem is, while I usually use a cluster of 6 "smallish" nodes (which 
can grow in time), he only has big ESX servers with huge disk space 
(which is already RAID-6 redundant) but wouldn't have the possibility to 
have 3+ nodes per DC.

This is out of my usual experience with Cassandra and, as far as I read 
around, out of most use-cases found on the website or this mailing list, 
so the question is:
does it make sense to use Cassandra with a big (let's talk 6TB today, up 
to 20TB in a few years) single-node DataCenter, and another single-node 
DataCenter (to act as disaster recovery)?

Thanks in advance for any suggestion or comment!

-- 
Lapo Luchini
lapo@lapo.it

Re: Huge single-node DCs (?)

Posted by Bowen Song <bo...@bso.ng>.

I'm sure there's a lots of pitfalls. A few of them in my mind right now:

  * With a single node, you will completely lose the benefit of high
    availability from Cassandra. Not only hardware failure will result
    in downtime, routine maintenance (such as software upgrade) can also
    result in downtime.
  * RAID6 does provide redundancy in case of a disk failure. However,
    RAID doesn't prevent bit rots, and many implementations (both
    software and hardware) of RAID don't even attempt to detect it. You
    are often at the mercy of the hard drive firmware's ability to
    detect bit rots and return an URE (Unrecoverable Read Error) instead
    of the rotten data. Based on my experience, even enterprise drives
    don't always do a good job, and SSDs can also fail miserably on
    this. A corrupted SSTable in a single node cluster could lead to
    permanent data loss, because Cassandra doesn't have a replica of the
    data on other nodes. Recovering the data from a RAID6 is
    theoretically possible, but it almost certainly will cause some
    downtime, and it's not going to be easy.
  * Drives of the same model, from the same batch, installed on the same
    server and used in the same RAID array tend to fail at roughly the
    same time. If you aren't careful enough to mix and match the drives,
    you may end up more than 2 drives failing at roughly the same time
    in your RAID6 and lose your data.
  * Having two very large nodes in a cluster, either within the same DC
    with RF=2 or split into two DCs with RF=1 each, will somehow help to
    address the above issues, but how long will the repairs take?
  * Depending on the rate of write, you may run into I/O bottlenecks
    because compactions can involve many very large SSTables, and this
    will be made worse by the slow repairs. If the compaction can't keep
    up with the rate of write, your Cassandra node is going to crash
    with the "too many open files" error.
  * For very large node, you may also run into memory size constraints.
    See
    https://cassandra.apache.org/doc/latest/operating/compression.html#operational-impact


On 08/04/2021 14:56, Lapo Luchini wrote:
> Hi, one project I wrote is using Cassandra to back the huge amount of 
> data it needs (data is written only once and read very rarely, but 
> needs to be accessible for years, so the storage needs become huge in 
> time and I chose Cassandra mainly for its horizontal scalability 
> regarding disk size) and a client of mine needs to install that on his 
> hosts.
>
> Problem is, while I usually use a cluster of 6 "smallish" nodes (which 
> can grow in time), he only has big ESX servers with huge disk space 
> (which is already RAID-6 redundant) but wouldn't have the possibility 
> to have 3+ nodes per DC.
>
> This is out of my usual experience with Cassandra and, as far as I 
> read around, out of most use-cases found on the website or this 
> mailing list, so the question is:
> does it make sense to use Cassandra with a big (let's talk 6TB today, 
> up to 20TB in a few years) single-node DataCenter, and another 
> single-node DataCenter (to act as disaster recovery)?
>
> Thanks in advance for any suggestion or comment!
>

Re: Huge single-node DCs (?)

Posted by Jeff Jirsa <jj...@gmail.com>.


> On Apr 9, 2021, at 6:15 AM, Joe Obernberger <jo...@gmail.com> wrote:
> 
> 
> We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty well.  I would love to be able to use Cassandra instead on a system like that.
> 

1PB is definitely in the range of viable cassandra clusters today

> Even all SSDs - you can get a system with 24, 2 TByte SSDs, which is too large for 1 instance of Cassandra.  Does 4.x address any of this?
> 

You said it’s too large for one instance ... so run more than one instance? 
> Ebay uses Cassandra and claims to have 80+ petabytes.  What do they do?
> 

Well done / congrats eBay! That’s a nice install. 

(They probably have lots and lots of clusters)

Re: Huge single-node DCs (?)

Posted by Kane Wilson <k...@raft.so>.

4.0 has gone a ways to enable better densification of nodes, but it wasn't
a main focus. We're probably still only thinking that 4TB - 8TB nodes will
be feasible (and then maybe only for expert users). The main problems tend
to be streaming, compaction, and repairs when it comes to dense nodes.

Ebay uses Cassandra and claims to have 80+ petabytes.  What do they do?

They 1. likely have a lot of nodes (1000+ node clusters are possible, just
hard), and 2. that 80 petabytes is undoubtedly spread across many clusters.

raft.so - Cassandra consulting, support, and managed services


On Fri, Apr 9, 2021 at 11:15 PM Joe Obernberger <
joseph.obernberger@gmail.com> wrote:

> We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty
> well.  I would love to be able to use Cassandra instead on a system like
> that.  HBase queries / scans are not the easiest to deal with, but, as with
> Cassandra, if you know the primary key, you can get to your data fast, even
> in trillions of rows.  Cassandra offers some capabilities that HBase
> doesn't that I would like to leverage, but yeah - how can you use Cassandra
> with modern equipment in a bare metal environment?  Kubernetes could make
> sense as long as you're able to maintain data locality with however your
> storage is configured.
> Even all SSDs - you can get a system with 24, 2 TByte SSDs, which is too
> large for 1 instance of Cassandra.  Does 4.x address any of this?
>
> Ebay uses Cassandra and claims to have 80+ petabytes.  What do they do?
>
> -Joe
> On 4/8/2021 6:35 PM, Elliott Sims wrote:
>
> I'm not sure I'd suggest building a single DIY Backblaze pod.  The SATA
> port multipliers are a pain both from a supply chain and systems management
> perspective.  Can be worth it when you're amortizing that across a lot of
> servers and can exert some leverage over wholesale suppliers, but less so
> for a one-off.  There's a lot more whitebox/OEM/etc options for
> high-density storage servers these days from Seagate, Dell, HP, Supermicro,
> etc that are worth a look.
>
>
> I'd agree with this (both examples) sounding like a poor fit for
> Cassandra.  Seems like you could always just spin up a bunch of Cassandra
> VMs in the ESX cluster instead of one big one, but something like MySQL or
> PostgreSQL might suit your needs better.  Or even some sort of flatfile
> archive with something like Parquet if it's more being kept "just in case"
> with no need for quick random access.
>
> For the 10PB example, it may be time to look at something like Hadoop, or
> maybe Ceph.
>
> On Thu, Apr 8, 2021 at 10:39 AM Bowen Song <bo...@bso.ng> <bo...@bso.ng>
> wrote:
>
>> This is off-topic. But if your goal is to maximise storage density and
>> also ensuring data durability and availability, this is what you should be
>> looking at:
>>
>>    - hardware:
>>    https://www.backblaze.com/blog/open-source-data-storage-server/
>>    - architecture and software:
>>    https://www.backblaze.com/blog/vault-cloud-storage-architecture/
>>
>>
>> On 08/04/2021 17:50, Joe Obernberger wrote:
>>
>> I am also curious on this question.  Say your use case is to store
>> 10PBytes of data in a new server room / data-center with new equipment,
>> what makes the most sense?  If your database is primarily write with little
>> read, I think you'd want to maximize disk space per rack space.  So you may
>> opt for a 2u server with 24 3.5" disks at 16TBytes each for a node with
>> 384TBytes of disk - so ~27 servers for 10PBytes.
>>
>> Cassandra doesn't seem to be the good choice for that configuration; the
>> rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd
>> need over 5000 servers.  This seems really unreasonable.
>>
>> -Joe
>>
>> On 4/8/2021 9:56 AM, Lapo Luchini wrote:
>>
>> Hi, one project I wrote is using Cassandra to back the huge amount of
>> data it needs (data is written only once and read very rarely, but needs to
>> be accessible for years, so the storage needs become huge in time and I
>> chose Cassandra mainly for its horizontal scalability regarding disk size)
>> and a client of mine needs to install that on his hosts.
>>
>> Problem is, while I usually use a cluster of 6 "smallish" nodes (which
>> can grow in time), he only has big ESX servers with huge disk space (which
>> is already RAID-6 redundant) but wouldn't have the possibility to have 3+
>> nodes per DC.
>>
>> This is out of my usual experience with Cassandra and, as far as I read
>> around, out of most use-cases found on the website or this mailing list, so
>> the question is:
>> does it make sense to use Cassandra with a big (let's talk 6TB today, up
>> to 20TB in a few years) single-node DataCenter, and another single-node
>> DataCenter (to act as disaster recovery)?
>>
>> Thanks in advance for any suggestion or comment!
>>
>>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free.
> www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_8132644495991221951_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>

Re: Huge single-node DCs (?)

Posted by Joe Obernberger <jo...@gmail.com>.

We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty 
well.  I would love to be able to use Cassandra instead on a system 
like that.  HBase queries / scans are not the easiest to deal with, 
but, as with Cassandra, if you know the primary key, you can get to your 
data fast, even in trillions of rows.� Cassandra offers some 
capabilities that HBase doesn't that I would like to leverage, but yeah 
- how can you use Cassandra with modern equipment in a bare metal 
environment?  Kubernetes could make sense as long as you're able to 
maintain data locality with however your storage is configured.
Even all SSDs - you can get a system with 24, 2 TByte SSDs, which is too 
large for 1 instance of Cassandra.  Does 4.x address any of this?

Ebay uses Cassandra and claims to have 80+ petabytes.  What do they do?

-Joe

On 4/8/2021 6:35 PM, Elliott Sims wrote:
> I'm not sure I'd suggest building a single DIY Backblaze pod.  The 
> SATA port multipliers are a pain both from a supply chain and systems 
> management perspective.  Can be worth it when you're amortizing that 
> across a lot of servers and can exert some leverage over wholesale 
> suppliers, but less so for a one-off.  There's a lot more 
> whitebox/OEM/etc options for high-density storage servers these days 
> from Seagate, Dell, HP, Supermicro, etc that are worth a look.
>
>
> I'd agree with this (both examples) sounding like a poor fit for 
> Cassandra.  Seems like you could always just spin up a bunch of 
> Cassandra VMs in the ESX cluster instead of one big one, but something 
> like MySQL or PostgreSQL might suit your needs better.  Or even some 
> sort of flatfile archive with something like Parquet if it's more 
> being kept "just in case" with no need for quick random access.�
>
> For the 10PB example, it may be time to look at something like Hadoop, 
> or maybe Ceph.
>
> On Thu, Apr 8, 2021 at 10:39 AM Bowen Song <bo...@bso.ng> wrote:
>
>     This is off-topic. But if your goal is to maximise storage density
>     and also ensuring data durability and availability, this is what
>     you should be looking at:
>
>       * hardware:
>         https://www.backblaze.com/blog/open-source-data-storage-server/
>         <https://www.backblaze.com/blog/open-source-data-storage-server/>
>       * architecture and software:
>         https://www.backblaze.com/blog/vault-cloud-storage-architecture/
>         <https://www.backblaze.com/blog/vault-cloud-storage-architecture/>
>
>
>     On 08/04/2021 17:50, Joe Obernberger wrote:
>>     I am also curious on this question.� Say your use case is to
>>     store 10PBytes of data in a new server room / data-center with
>>     new equipment, what makes the most sense?  If your database is
>>     primarily write with little read, I think you'd want to maximize
>>     disk space per rack space.  So you may opt for a 2u server with
>>     24 3.5" disks at 16TBytes each for a node with 384TBytes of disk
>>     - so ~27 servers for 10PBytes.
>>
>>     Cassandra doesn't seem to be the good choice for that
>>     configuration; the rule of thumb that I'm hearing is ~2Tbytes per
>>     node, in which case we'd need over 5000 servers.  This seems
>>     really unreasonable.
>>
>>     -Joe
>>
>>     On 4/8/2021 9:56 AM, Lapo Luchini wrote:
>>>     Hi, one project I wrote is using Cassandra to back the huge
>>>     amount of data it needs (data is written only once and read very
>>>     rarely, but needs to be accessible for years, so the storage
>>>     needs become huge in time and I chose Cassandra mainly for its
>>>     horizontal scalability regarding disk size) and a client of mine
>>>     needs to install that on his hosts.
>>>
>>>     Problem is, while I usually use a cluster of 6 "smallish" nodes
>>>     (which can grow in time), he only has big ESX servers with huge
>>>     disk space (which is already RAID-6 redundant) but wouldn't have
>>>     the possibility to have 3+ nodes per DC.
>>>
>>>     This is out of my usual experience with Cassandra and, as far as
>>>     I read around, out of most use-cases found on the website or
>>>     this mailing list, so the question is:
>>>     does it make sense to use Cassandra with a big (let's talk 6TB
>>>     today, up to 20TB in a few years) single-node DataCenter, and
>>>     another single-node DataCenter (to act as disaster recovery)?
>>>
>>>     Thanks in advance for any suggestion or comment!
>>>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
> 	Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

RE: Huge single-node DCs (?)

Posted by "Durity, Sean R" <SE...@homedepot.com>.

DataStax Enterprise has a new-ish feature set called Big Node that is supposed to help with using much denser nodes. We are going to be doing some testing with that for a similar use case with ever-growing disk needs, but no real increase in read or write volume. At some point it may become available in the open source version, too.

Sean Durity – Staff Systems Engineer, Cassandra

From: Elliott Sims <el...@backblaze.com>
Sent: Thursday, April 8, 2021 6:36 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Huge single-node DCs (?)

I'm not sure I'd suggest building a single DIY Backblaze pod.  The SATA port multipliers are a pain both from a supply chain and systems management perspective.  Can be worth it when you're amortizing that across a lot of servers and can exert some leverage over wholesale suppliers, but less so for a one-off.  There's a lot more whitebox/OEM/etc options for high-density storage servers these days from Seagate, Dell, HP, Supermicro, etc that are worth a look.

I'd agree with this (both examples) sounding like a poor fit for Cassandra.  Seems like you could always just spin up a bunch of Cassandra VMs in the ESX cluster instead of one big one, but something like MySQL or PostgreSQL might suit your needs better.  Or even some sort of flatfile archive with something like Parquet if it's more being kept "just in case" with no need for quick random access.

For the 10PB example, it may be time to look at something like Hadoop, or maybe Ceph.

On Thu, Apr 8, 2021 at 10:39 AM Bowen Song <bo...@bso.ng>> wrote:

This is off-topic. But if your goal is to maximise storage density and also ensuring data durability and availability, this is what you should be looking at:

  *   hardware: https://www.backblaze.com/blog/open-source-data-storage-server/ [backblaze.com]<https://urldefense.com/v3/__https:/www.backblaze.com/blog/open-source-data-storage-server/__;!!M-nmYVHPHQ!bSQzKE3v6t0ekwai3LBCp77OWeRZgl-0xUfoU3CfxwPkUCpitRxUWDlQL5dq-aP3rsu9Gco$>
  *   architecture and software: https://www.backblaze.com/blog/vault-cloud-storage-architecture/ [backblaze.com]<https://urldefense.com/v3/__https:/www.backblaze.com/blog/vault-cloud-storage-architecture/__;!!M-nmYVHPHQ!bSQzKE3v6t0ekwai3LBCp77OWeRZgl-0xUfoU3CfxwPkUCpitRxUWDlQL5dq-aP3vxAsNFM$>

On 08/04/2021 17:50, Joe Obernberger wrote:
I am also curious on this question.  Say your use case is to store 10PBytes of data in a new server room / data-center with new equipment, what makes the most sense?  If your database is primarily write with little read, I think you'd want to maximize disk space per rack space.  So you may opt for a 2u server with 24 3.5" disks at 16TBytes each for a node with 384TBytes of disk - so ~27 servers for 10PBytes.

Cassandra doesn't seem to be the good choice for that configuration; the rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd need over 5000 servers.  This seems really unreasonable.

-Joe

On 4/8/2021 9:56 AM, Lapo Luchini wrote:

Hi, one project I wrote is using Cassandra to back the huge amount of data it needs (data is written only once and read very rarely, but needs to be accessible for years, so the storage needs become huge in time and I chose Cassandra mainly for its horizontal scalability regarding disk size) and a client of mine needs to install that on his hosts.

Problem is, while I usually use a cluster of 6 "smallish" nodes (which can grow in time), he only has big ESX servers with huge disk space (which is already RAID-6 redundant) but wouldn't have the possibility to have 3+ nodes per DC.

This is out of my usual experience with Cassandra and, as far as I read around, out of most use-cases found on the website or this mailing list, so the question is:
does it make sense to use Cassandra with a big (let's talk 6TB today, up to 20TB in a few years) single-node DataCenter, and another single-node DataCenter (to act as disaster recovery)?

Thanks in advance for any suggestion or comment!

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Huge single-node DCs (?)

Posted by Elliott Sims <el...@backblaze.com>.

I'm not sure I'd suggest building a single DIY Backblaze pod.  The SATA
port multipliers are a pain both from a supply chain and systems management
perspective.  Can be worth it when you're amortizing that across a lot of
servers and can exert some leverage over wholesale suppliers, but less so
for a one-off.  There's a lot more whitebox/OEM/etc options for
high-density storage servers these days from Seagate, Dell, HP, Supermicro,
etc that are worth a look.

I'd agree with this (both examples) sounding like a poor fit for
Cassandra.  Seems like you could always just spin up a bunch of Cassandra
VMs in the ESX cluster instead of one big one, but something like MySQL or
PostgreSQL might suit your needs better.  Or even some sort of flatfile
archive with something like Parquet if it's more being kept "just in case"
with no need for quick random access.

For the 10PB example, it may be time to look at something like Hadoop, or
maybe Ceph.

On Thu, Apr 8, 2021 at 10:39 AM Bowen Song <bo...@bso.ng> wrote:

> This is off-topic. But if your goal is to maximise storage density and
> also ensuring data durability and availability, this is what you should be
> looking at:
>
>    - hardware:
>    https://www.backblaze.com/blog/open-source-data-storage-server/
>    - architecture and software:
>    https://www.backblaze.com/blog/vault-cloud-storage-architecture/
>
>
> On 08/04/2021 17:50, Joe Obernberger wrote:
>
> I am also curious on this question.  Say your use case is to store
> 10PBytes of data in a new server room / data-center with new equipment,
> what makes the most sense?  If your database is primarily write with little
> read, I think you'd want to maximize disk space per rack space.  So you may
> opt for a 2u server with 24 3.5" disks at 16TBytes each for a node with
> 384TBytes of disk - so ~27 servers for 10PBytes.
>
> Cassandra doesn't seem to be the good choice for that configuration; the
> rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd
> need over 5000 servers.  This seems really unreasonable.
>
> -Joe
>
> On 4/8/2021 9:56 AM, Lapo Luchini wrote:
>
> Hi, one project I wrote is using Cassandra to back the huge amount of data
> it needs (data is written only once and read very rarely, but needs to be
> accessible for years, so the storage needs become huge in time and I chose
> Cassandra mainly for its horizontal scalability regarding disk size) and a
> client of mine needs to install that on his hosts.
>
> Problem is, while I usually use a cluster of 6 "smallish" nodes (which can
> grow in time), he only has big ESX servers with huge disk space (which is
> already RAID-6 redundant) but wouldn't have the possibility to have 3+
> nodes per DC.
>
> This is out of my usual experience with Cassandra and, as far as I read
> around, out of most use-cases found on the website or this mailing list, so
> the question is:
> does it make sense to use Cassandra with a big (let's talk 6TB today, up
> to 20TB in a few years) single-node DataCenter, and another single-node
> DataCenter (to act as disaster recovery)?
>
> Thanks in advance for any suggestion or comment!
>
>

Re: Huge single-node DCs (?)

Posted by Bowen Song <bo...@bso.ng>.

This is off-topic. But if your goal is to maximise storage density and 
also ensuring data durability and availability, this is what you should 
be looking at:

  * hardware:
    https://www.backblaze.com/blog/open-source-data-storage-server/
  * architecture and software:
    https://www.backblaze.com/blog/vault-cloud-storage-architecture/


On 08/04/2021 17:50, Joe Obernberger wrote:
> I am also curious on this question.  Say your use case is to store 
> 10PBytes of data in a new server room / data-center with new 
> equipment, what makes the most sense?  If your database is primarily 
> write with little read, I think you'd want to maximize disk space per 
> rack space.  So you may opt for a 2u server with 24 3.5" disks at 
> 16TBytes each for a node with 384TBytes of disk - so ~27 servers for 
> 10PBytes.
>
> Cassandra doesn't seem to be the good choice for that configuration; 
> the rule of thumb that I'm hearing is ~2Tbytes per node, in which case 
> we'd need over 5000 servers.  This seems really unreasonable.
>
> -Joe
>
> On 4/8/2021 9:56 AM, Lapo Luchini wrote:
>> Hi, one project I wrote is using Cassandra to back the huge amount of 
>> data it needs (data is written only once and read very rarely, but 
>> needs to be accessible for years, so the storage needs become huge in 
>> time and I chose Cassandra mainly for its horizontal scalability 
>> regarding disk size) and a client of mine needs to install that on 
>> his hosts.
>>
>> Problem is, while I usually use a cluster of 6 "smallish" nodes 
>> (which can grow in time), he only has big ESX servers with huge disk 
>> space (which is already RAID-6 redundant) but wouldn't have the 
>> possibility to have 3+ nodes per DC.
>>
>> This is out of my usual experience with Cassandra and, as far as I 
>> read around, out of most use-cases found on the website or this 
>> mailing list, so the question is:
>> does it make sense to use Cassandra with a big (let's talk 6TB today, 
>> up to 20TB in a few years) single-node DataCenter, and another 
>> single-node DataCenter (to act as disaster recovery)?
>>
>> Thanks in advance for any suggestion or comment!
>>

Re: Huge single-node DCs (?)

Posted by Joe Obernberger <jo...@gmail.com>.

I am also curious on this question.  Say your use case is to store 
10PBytes of data in a new server room / data-center with new equipment, 
what makes the most sense?  If your database is primarily write with 
little read, I think you'd want to maximize disk space per rack space.  
So you may opt for a 2u server with 24 3.5" disks at 16TBytes each for a 
node with 384TBytes of disk - so ~27 servers for 10PBytes.

Cassandra doesn't seem to be the good choice for that configuration; the 
rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd 
need over 5000 servers.  This seems really unreasonable.

-Joe

On 4/8/2021 9:56 AM, Lapo Luchini wrote:
> Hi, one project I wrote is using Cassandra to back the huge amount of 
> data it needs (data is written only once and read very rarely, but 
> needs to be accessible for years, so the storage needs become huge in 
> time and I chose Cassandra mainly for its horizontal scalability 
> regarding disk size) and a client of mine needs to install that on his 
> hosts.
>
> Problem is, while I usually use a cluster of 6 "smallish" nodes (which 
> can grow in time), he only has big ESX servers with huge disk space 
> (which is already RAID-6 redundant) but wouldn't have the possibility 
> to have 3+ nodes per DC.
>
> This is out of my usual experience with Cassandra and, as far as I 
> read around, out of most use-cases found on the website or this 
> mailing list, so the question is:
> does it make sense to use Cassandra with a big (let's talk 6TB today, 
> up to 20TB in a few years) single-node DataCenter, and another 
> single-node DataCenter (to act as disaster recovery)?
>
> Thanks in advance for any suggestion or comment!
>