You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tubemq.apache.org by Goson zhang <go...@apache.org> on 2020/12/08 02:30:33 UTC

[TOPIC] How reliable is the TubeMQ system?

Yesterday there was a business consultation, how reliable is the TubeMQ
system?

I understand that the system reliability is divided into two parts, one is
the reliability of the system services, and the other is the data
reliability:

System reliability:
1.1 As long as any one of the Brokers survive allocated by the topic in the
cluster, the production and consumption services of the topic are available;
1.2 Based on point 1.1, as long as all topics in the cluster have any
Broker alive, the topic of the entire cluster is available for production
and consumption services;
1.3 Even if the control node Masters all hangs up, the new production and
consumption in the cluster will be affected, but the registered production
and consumption will not stop production and consumption;

Data reliability:
In the TubeMQ system, data is stored in a single-node disk RAID10 copy
mode, data may be lost only under the following conditions:
2.1 When the machine is powered off, the data that has been successfully
replied but not consumed yet and is in the memory will be lost; after the
machine is online, the stored data will not be affected;
2.2 Disk abnormalities that cannot be held by RAID10 hold, data that has
returned successfully but not yet consumed will be affected; after the disk
is repaired, data that has been stored but not recovered will be affected
2.3 Daily bad disks, the production and consumption of broken disk Broker
nodes will not be affected.

Related quantitative reliability indicators, I personally feel that it is
not easy to evaluate, it is related to the hardware situation and the IDC
environment, but there is an application situation for reference: according
to the statistics in our environment in 2019, the entire TubeMQ cluster
1500 machines are about 40 in the whole year which ping abnormalities and
abnormal disk group damage that cannot be held by RAID10. At the same time,
the machines we use are second-hand machines and equipment that have been
eliminated after several years of business use.

Thanks

Re: [TOPIC] How reliable is the TubeMQ system?

Posted by kaynewu <ka...@apache.org>.

  sounds good！
 +1
Goson zhang <go...@apache.org> 于2020年12月8日周二 上午10:30写道：

> Yesterday there was a business consultation, how reliable is the TubeMQ
> system?
>
> I understand that the system reliability is divided into two parts, one is
> the reliability of the system services, and the other is the data
> reliability:
>
> System reliability:
> 1.1 As long as any one of the Brokers survive allocated by the topic in the
> cluster, the production and consumption services of the topic are
> available;
> 1.2 Based on point 1.1, as long as all topics in the cluster have any
> Broker alive, the topic of the entire cluster is available for production
> and consumption services;
> 1.3 Even if the control node Masters all hangs up, the new production and
> consumption in the cluster will be affected, but the registered production
> and consumption will not stop production and consumption;
>
> Data reliability:
> In the TubeMQ system, data is stored in a single-node disk RAID10 copy
> mode, data may be lost only under the following conditions:
> 2.1 When the machine is powered off, the data that has been successfully
> replied but not consumed yet and is in the memory will be lost; after the
> machine is online, the stored data will not be affected;
> 2.2 Disk abnormalities that cannot be held by RAID10 hold, data that has
> returned successfully but not yet consumed will be affected; after the disk
> is repaired, data that has been stored but not recovered will be affected
> 2.3 Daily bad disks, the production and consumption of broken disk Broker
> nodes will not be affected.
>
> Related quantitative reliability indicators, I personally feel that it is
> not easy to evaluate, it is related to the hardware situation and the IDC
> environment, but there is an application situation for reference: according
> to the statistics in our environment in 2019, the entire TubeMQ cluster
> 1500 machines are about 40 in the whole year which ping abnormalities and
> abnormal disk group damage that cannot be held by RAID10. At the same time,
> the machines we use are second-hand machines and equipment that have been
> eliminated after several years of business use.
>
> Thanks
>