You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by santosh yadav <sa...@gmail.com> on 2019/10/08 15:28:56 UTC

Loading deep storage segments on demand

We have following use case:
We receive large amount of data every day. Most queries (99% of the times) run on that day's data or that week's data. Queries on data older than a week are rare, but can happen.

Size of one week's data is under few tera bytes, however total data over time would be in peta bytes. To reduce total cost, we want to avoid storing all data in local storage and also avoid a Druid cluster with hundreds of nodes. We would instead like to keep only relevant data in local storage and move rest of data to cheaper deep storage like S3. Load data from deep storage on demand, only when a time series query request such data. We think this will help us to run/manage a smaller druid cluster on-prem and use much smaller local storage compare to deep storage.

From our testing and reading material on Druid, looks like it is not possible to do this today. Please correct me If am wrong. 
Also, would a feature like this fit in Druid product roadmap? Are there any pitfalls or reasons which will make this a bad idea? Was this considered earlier, spec'ed out but dropped for any reason? If it is merely DEV efforts, we won't mind to do the work. 
Highly appreciate comments from Druid development community. Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org