You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Apache Pulsar Slack <ap...@gmail.com> on 2020/01/27 09:11:02 UTC

Slack digest for #dev - 2020-01-27

2020-01-26 11:42:32 UTC - Rodrigo Batista: @Rodrigo Batista has joined the channel
----
2020-01-26 11:49:25 UTC - Nikita Mathur: @Nikita Mathur has joined the channel
----
2020-01-27 00:24:14 UTC - Eugen: Redundant Pulsar Producers
----
2020-01-27 00:24:28 UTC - Eugen: There is a use case not uncommon in the financial world that I think is not satisfactorily handled by Pulsar yet. In this scenario, there is a single high-throuput source of data which, for reasons of redundancy, sends two copies of the exact same messages via 2 different routes, in our case two different (sets of) UDP multicast groups. The messages on the different groups are identical and share a contiguous sequence id. That sequence id is reset every day before market open of the stock exchange.

The stock exchange client can completely redundantly (different network routers/switches/dedicated line to the exchange etc.) receive the data, and under normal circumstances basically discard half of the data received, as it is redundant. But as this is UDP, packets may arrive out of order, and there may be the occasional loss of a packet on either or even both streams. It would be great if we could have two redundant Pulsar producers that just feed the messages from the exchange into the Pulsar cluster, and Pulsar would take care of deduplication and ordering.

There are, as far as I can tell, 3 (in fact completely orthogonal) features missing in Pulsar for this to work:
1. a "deduplicate across producers" option, which would make sure that the last seen SeqId is stored and handled per topic, not per producer
2. an deduplication option to sort incoming messages with out-of-order SeqIds
3. an admin (?) function to reset `HighestSequenceId`

So I would like to hear Pulsar developers' opinions about this, and if this is perhaps worth a PIP.

N.B. I have implemented above ordering and deduplication logic twice before, so I'm familiar with the task, however I'm still pretty much unfamiliar with Pulsar, as I just started evaluating whether Pulsar is a good fit for us or not.
----
2020-01-27 05:55:05 UTC - Antti Kaikkonen: @Antti Kaikkonen has joined the channel
----
2020-01-27 07:21:02 UTC - Vladimir Shchur: Can you use intermediate topic? So all messages will go there without deduplication first, and then third producer will use the topic with deduplication.
----
2020-01-27 07:48:47 UTC - Eugen: That was my _Plan B_. Plan B, because it would be much more resource-intensive than deduplicating in the outermost ingestion step. A single one of my input streams has an average of a several 10k msgs/sec and peaks of 250k msg/sec. If we don't handle 2 such streams at the ingestion step, resource consumption of both bookie disks and broker cpu would be more than double when using the indirection of Plan B.
----