You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/14 11:55:09 UTC

[GitHub] [druid] mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service

mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689

Hi @jihoonson

* What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?

We are proposing a 2 step approach:

1. Make a naive pubsub indexing service which provides all the guarantees that a regular pubsub consumer would do - that is, at-least once message semantics. This would in in-line with any normal pubsub consumer would work.

2. Do some basic research into how systems such as dataflow achieve exactly once processing with pubsub. It is clearly possible to achieve this, since dataflow does it with pubsub (although the details of precisely how are not yet clear to us). This will be more of an exploratory work.

* Description on the overall algorithm including what the supervisor and its tasks do, respectively.
- The Supervisor looks pretty similar to the KafkaStreamSupervisor's basic functions - creation and management of tasks

- If more tasks are required to maintain active task count, it submits new tasks.

- A single task would be doing the following basic things:

- Connects to a pubsub subscription
- Pull in a batch from pubsub (relevant tuning parameters should be available in config)
- Packets are handed off for persistence.
- On successfully persisting, send back an ACK message to pubsub for the batch.

* Does the proposed indexing service provide linear scalability? If so, how does it provide?

The service can keep launching new tasks to process data from subscriptions, as needed. The supervisor can do periodic checks for pubsub metrics, and if the rate of message consumption is falling behind compared to the production rate, it can launch new tasks across the cluster.

* How does it handle transient failures such as task failures?
- If a task fails before a successful ACK has been sent out, it should be reprocessed.
- Data successfully persisted, but ACK delivery fails. In this case, we would want to introduce a retry policy.
- In case of permanent failure, pubsub would redeliver the message, which is in line with the at-least once guarantee of the indexing service.

* Exactly Once case: I think it's fair to say we currently don't have an extremely clear understanding of making exactly once work, but we know other systems do claim to provide those guarantees. I'm interested in trying to see if we can achieve the same with Druid, but for that to happen, the foundation as described above needs to be built first, IMHO.

There are unanswered questions here that we haven't fleshed out yet. Would be happy to brainstorm. :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org