You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/23 18:46:24 UTC

[GitHub] jihoonson commented on issue #6036: use S3 as a backup storage for hdfs deep storage

jihoonson commented on issue #6036: use S3 as a backup storage for hdfs deep storage
URL: https://github.com/apache/incubator-druid/pull/6036#issuecomment-407161433

Hi @gaodayue, thanks for the PR. I have a question.

> In many organization, Hadoop and HDFS are typically used in offline data analysis, while Druid is targeting online data serving. Thus SLA provided by HDFS often can't meet the needs of Druid.

- I think, if this is the case, you might need to somehow increase write throughput of your HDFS or use a separate deep storage. If the first option is not available for you, does it make sense to use only S3 as your deep storage?

For the idea of this PR, I'm not sure it is a good idea. Maybe we need to define the concept of backup deep storage for all deep storage types and support it. Maybe the primary deep storage and backup deep storage should be in sync automatically.

But, this PR is restricted to support it for only HDFS deep storage and looks to require another tool, called `restore-hdfs-segment`, to keep all segments to reside in HDFS. This would need additional operations which make Druid operation difficult.

Side comment: Kafka indexing service guarantees exactly-once data ingestion, and thus data loss is never expected to happen. If deep storage is not available, all attempts to publish segments would fail and every task should restart from the same offset when publishing failed. This needs reprocessing the same data which can make the ingestion slow, but there should be no data loss or data duplication.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org