You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/10 08:45:55 UTC

[GitHub] [druid] aditya-r-m opened a new issue #9343: [Proposal] Pubsub Indexing Service

aditya-r-m opened a new issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343

### Motivation

For streaming ingestion, Kafka/Kinesis queues are the two primary options for Druid as of now.
Proposal is to write an extension that will allow Data ingestion from Google Cloud Pubsub.

### Proposed changes

The proposed extension will work in a manner very similar to Seekable Stream Supervisors & Tasks that we currently have & will be a simplified version of the same in most cases.

Key differences between pubsub & kafka queues in context of this implementation are as follows,
1. Unlike Kafka, PubSub does not have a concept of ordered logs. Packets are pulled in batches from the cloud & acknowledgements are sent after successful processing per packet.
2. While Kafka has a Topic/Partition hierarchy such that one packet for a topic goes only in one of it's partitions, PubSub has Topic/Subscription hierarchy where any packets that is sent to a topic is replicated in all of the subscriptions. In the ideal case, each packet should be pulled only once from a subscription.

One key design decision that we are suggesting is to have a completely independent extension that does not share any logic with the Seekable Stream Ingestion extensions.

1. The ingestion specs will be very similar to Kafka Ingestion specs, with configuration options which mostly overlap but with some additions & removals.
2. The structure & patterns in the code will be a simplified merger of SeekableStreamIndexingService & KafkaIndexingService.

One key challenge to note is that PubSub does not provide exactly-once semantics like Kafka.
This means that in cases of high lag or failures, consumers may pull the same packet more than once from the subscription.
One reasonable approach to tackle this is best effort deduplication.
There are techniques to minimize duplication, a few of them explained as follows,

- Ack deadline tweaks: A consumer can reset ack deadlines when some packets are taking more time to process than expected. Also, having a sensible ack deadline configuration will also have a massive impact to start with.
- Configurable LRU Cache / Bloom filter based dedup: A local sketch of unique packets encountered so far in the previous 'x' minutes can prevent duplicate insertion on the basis of unique packet ids.

It could be possible to provide perfect deduplication using a shared key-value store but that would be out of scope for the first version of the extension.

### Rationale

A discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of.

### Operational impact

Since the extension is a completely new feature with different pathways, it does not have operational impact on running druid clusters.
The dependencies that are globally being updated are the Guava & Guice versions which are very outdated & do not work with latest pubsub libraries. Any possible regression from it should be caught by automated tests.

### Test plan
TODO

### Future work
TODO

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689

Hi @jihoonson

* What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?

We are proposing a 2 step approach:

- Make a naive pubsub indexing service which provides all the guarantees that a regular pubsub consumer would do - that is, at-least once message semantics. This would in in-line with any normal pubsub consumer would work.

- Do some basic research into how systems such as dataflow achieve exactly once processing with pubsub. It is clearly possible to achieve this, since dataflow does it with pubsub (although the details of precisely how are not yet clear to us). This will be more of an exploratory work.

* Description on the overall algorithm including what the supervisor and its tasks do, respectively.
- The Supervisor looks pretty similar to the KafkaStreamSupervisor's basic functions - creation and management of tasks

- If more tasks are required to maintain active task count, it submits new tasks.

- A single task would be doing the following basic things:

- Connects to a pubsub subscription
- Pull in a batch from pubsub (relevant tuning parameters should be available in config)
- Packets are handed off for persistence.
- On successfully persisting, send back an ACK message to pubsub for the batch.

* Does the proposed indexing service provide linear scalability? If so, how does it provide?

The service can keep launching new tasks to process data from subscriptions, as needed. The supervisor can do periodic checks for pubsub metrics, and if the rate of message consumption is falling behind compared to the production rate, it can launch new tasks across the cluster.

* How does it handle transient failures such as task failures?
- If a task fails before a successful ACK has been sent out, it should be reprocessed.
- Data successfully persisted, but ACK delivery fails. In this case, we would want to introduce a retry policy.
- In case of permanent failure, pubsub would redeliver the message, which is in line with the at-least once guarantee of the indexing service.

* Exactly Once case: I think it's fair to say we currently don't have an extremely clear understanding of making exactly once work, but we know other systems do claim to provide those guarantees. I'm interested in trying to see if we can achieve the same with Druid, but for that to happen, the foundation as described above needs to be built first, IMHO.

There are unanswered questions here that we haven't fleshed out yet. Would be happy to brainstorm. :)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689

Hi @jihoonson

* What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?

We are proposing a 2 step approach:

* Make a naive pubsub indexing service which provides all the guarantees that a regular pubsub consumer would do - that is, at-least once message semantics. This would in in-line with any normal pubsub consumer would work.

* Do some basic research into how systems such as dataflow achieve exactly once processing with pubsub. It is clearly possible to achieve this, since dataflow does it with pubsub (although the details of precisely how are not yet clear to us). This will be more of an exploratory work.

- If more tasks are required to maintain active task count, it submits new tasks.

- A single task would be doing the following basic things:

* Does the proposed indexing service provide linear scalability? If so, how does it provide?

There are unanswered questions here that we haven't fleshed out yet. Would be happy to brainstorm. :)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] aditya-r-m commented on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

aditya-r-m commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-584023622
 
 
   cc @jihoonson @mgill25

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] aditya-r-m commented on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

aditya-r-m commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-590179700
 
 
   Hi @jihoonson I have updated the description with additional details that @mgill25 mentioned above.
   Please review if anything else should be added or updated.
   More details we can add as we move forward with the implementation & testing.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

mgill25 edited a comment on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689

Hi @jihoonson

* What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?

We are proposing a 2 step approach:

- If more tasks are required to maintain active task count, it submits new tasks.

- A single task would be doing the following basic things:

* Does the proposed indexing service provide linear scalability? If so, how does it provide?

There are unanswered questions here that we haven't fleshed out yet. Would be happy to brainstorm. :)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] jihoonson commented on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

jihoonson commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-585012920
 
 
   Hi @aditya-r-m, thank you for the proposal!
   
   > The proposed extension will work in a manner very similar to Seekable Stream Supervisors & Tasks that we currently have & will be a simplified version of the same in most cases.
   
   I think this explanation is quite vague and I'm not sure how the Pubsub indexing service would work especially with the different architecture of Pub/Sub from Kafka/Kinesis. Would you please add more details including the below?
   
   - What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?
   - Description on the overall algorithm including what the supervisor and its tasks do, respectively.
   - Does the proposed indexing service provide linear scalability? If so, how does it provide?
   - How does it handle transient failures such as task failures?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service

Posted by GitBox <gi...@apache.org>.

mgill25 commented on issue #9343: [Proposal] Pubsub Indexing Service
URL: https://github.com/apache/druid/issues/9343#issuecomment-586257689

Hi @jihoonson

* What semantics is guaranteed by the proposed indexing service? I don't think the exactly-once ingestion is possible. And how does the proposed indexing service guarantee it?

We are proposing a 2 step approach:

1. Make a naive pubsub indexing service which provides all the guarantees that a regular pubsub consumer would do - that is, at-least once message semantics. This would in in-line with any normal pubsub consumer would work.

2. Do some basic research into how systems such as dataflow achieve exactly once processing with pubsub. It is clearly possible to achieve this, since dataflow does it with pubsub (although the details of precisely how are not yet clear to us). This will be more of an exploratory work.

- If more tasks are required to maintain active task count, it submits new tasks.

- A single task would be doing the following basic things:

* Does the proposed indexing service provide linear scalability? If so, how does it provide?

There are unanswered questions here that we haven't fleshed out yet. Would be happy to brainstorm. :)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org