You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/07/26 21:39:37 UTC

[GitHub] gianm commented on issue #6036: use S3 as a backup storage for hdfs deep storage

gianm commented on issue #6036: use S3 as a backup storage for hdfs deep storage
URL: https://github.com/apache/incubator-druid/pull/6036#issuecomment-408243515

My feeling is that Druid already deals with potential HDFS reliability issues by retrying, and that should be enough if your HDFS cluster is well run. Pushing to an alternate deep storage, if the original one is not available, seems like a hack to work around a poorly run HDFS cluster.

> I've thought about implementing something like composite-deep-storage which can add backup abilities to all deep storages at first, but found it's non-trivial to load multiple deep storage extensions inside composite-deep-storage. So I decide to add support hdfs-deep-storage only just because we're using it.

IMO this is the best approach, since it's composable, rather than a tight link between the HDFS and S3 implementations. It could also be done as a contrib extension that doesn't touch the HDFS or S3 implementations, which I think is preferable for more niche functionality like this. It means less surface area to test in the main codebase, which is a good thing. @gaodayue, what sort of difficulty did you have when you tried to implement this? Maybe it it something we can solve?

With regard to @jihoonson's comment:

> I think this concept is more appropriate for backup deep storage because, if the deep storage is not available for writing, it highly likely has other problems which can introduce unexpected behaviors to Druid.

Yeah, the 'bad' HDFS cluster would likely have other problems, like maybe you can't read from it. So this backup deep storage technique would improve availability for writes (because you can write to either deep storage) but would _worsen_ availability for reads (because you must read from the one that has your segment, and that means you will have reading downtime if any one of your deep storages is down). I can see that this makes sense for use cases where you want to maximize the ability of realtime tasks to write new data, and are ok with potentially worse ability for historicals to download segments.

I think all signs point to this functionality making more sense as a contrib extension, so I hope we can figure out how to do that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org