You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/24 05:47:26 UTC

[GitHub] gaodayue commented on issue #6036: use S3 as a backup storage for hdfs deep storage

gaodayue commented on issue #6036: use S3 as a backup storage for hdfs deep storage
URL: https://github.com/apache/incubator-druid/pull/6036#issuecomment-407288499
 
 
   Hi @jihoonson , thanks for your comments. Answering your question
   
   > I think, if this is the case, you might need to somehow increase write throughput of your HDFS or use a separate deep storage. If the first option is not available for you, does it make sense to use only S3 as your deep storage?
   
   Our company operates its own big hadoop cluster (>5k nodes) for us to use. Switching to s3-deep-storage requires extra cost and is not an option for us.
   
   > Maybe we need to define the concept of backup deep storage for all deep storage types and support it. 
   
   I've thought about implementing something like composite-deep-storage which can add backup abilities to all deep storages at first, but found it's non-trivial to load multiple deep storage extensions inside composite-deep-storage. So I decide to add support hdfs-deep-storage only just because we're using it.
   
   > Maybe the primary deep storage and backup deep storage should be in sync automatically.
   
   What do you mean by "in sync"? Do you mean all segments pushed to backup storage should be copied back to primary storage eventually? If that's the case, I don't think there is a strong need for it (explained below). 
   
   > But, this PR is restricted to support it for only HDFS deep storage and looks to require another tool, called restore-hdfs-segment, to keep all segments to reside in HDFS. This would need additional operations which make Druid operation difficult.
   
   First, the restore-hdfs-segment tool is not required to achieve the goal of hdfs fault tolerant. I developed it for other reasons. One is to pay less for S3 and the other is that we occasionally need to migrate datasource from one cluster to another, and we want all segments reside on hdfs so that we can simply use the insert-segment-to-db tool to migrate all segments. If other users don't have the same concern, they can simply ignore restore-hdfs-segment.
   
   Second, concerning operation complexity, I think it's just a trade-off made between availability and cost. And the extra operational cost is as low as run restore-hdfs-segment manually after a hdfs failure or set up a daily crontab to run restore-hdfs-segment.
   
   > Kafka indexing service guarantees exactly-once data ingestion, and thus data loss is never expected to happen. If deep storage is not available, all attempts to publish segments would fail and every task should restart from the same offset when publishing failed. 
   
   Yeah I'm aware of it. But for other reasons we are still using tranquility as the main ingestion tool and hdfs failure do cause data loss several times and it's a big pain for us. We've added this feature to solve the problem, and I think maybe it's also useful for other people.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org