You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/02/05 21:31:54 UTC

[GitHub] sascha-coenen commented on issue #5543: [Proposal] Native parallel batch indexing

sascha-coenen commented on issue #5543: [Proposal] Native parallel batch indexing
URL: https://github.com/apache/incubator-druid/issues/5543#issuecomment-460811816

This motion is AWESOME AWESOME AWESOME!!!!
Well done!! I cannot wait to see the two phase shuffle. This is SO much needed.

I read in the comments section of related PRs about why one would need yet another data processing framework and what would be the issues with Spark/Hadoop.
This puzzles me for the following reason:
I have tasked several people to find out how to combine batch processing and stream processing for Druid and although a lot of time was being sunk into the subject matter, not a SINGLE person was able to come up with a viable solution, myself included. Let it be said too, that we have been running a million dollar Druid cluster for several years now and keep trillions of records in it.
So neither are we new to Druid nor are we idiots and yet we keep scratching our heads about how to put the pieces together.

In my opinion, Druid needs native indexing support more than anything, especially in the context of finding a more wide-spread adoption and growing the community.

I very much hope that more and more people can join in this effort. Most database systems come with native DML support and thus, competitor products like MPP databases such as Vertica have native support for ingesting big-data workloads.
Having a native batch indexing support in Druid would not only make Druid more competitive and easier sell, but it is strategically also an enabler for advanced setups, like putting Druid on kubernetes. Containerizing Hadoop/Spark alone is not easy and far from being a small effort and doing it in a way that lets such a setup play nicely with Druid requires handcrafting the whole setup.
Middlemanager however can easily be containerized (although it would be even nicer if there weren't any peons I guess) which in turn is a segway to co-locating different workloads on the same hardware. Achieving this for an ecosystem that encompasses Spark/Hadoop is something that only large companies with deep pockets and a bugdet for inhouse customizations can achieve.

The second most needed feature is OLAP cubing (materialized views) which was added to Druid 0.13 as a prototype recently but currently requires a Hadoop cluster. So folks who went with a Spark-based indexing cannot use it unless they reinvent the wheel by adding support for it too.
So in this sense, it is NOT the creation of a native processing framework that is "re-inventing the wheel" but on the contrary, it is precisely the previously chosen approach of having external processing frameworks that deserve this label.

---

>> I'm going crazy because the library versions of Hadoop and Druid can't match
+1

---

>> I'm not sure about sharing the same shuffle system by both indexing and querying now because they need different requirements.
+1
Great thinking on behalf of jihoonson to propose this, but in the spirit of making babysteps it seems that one should first try to keep things easy by thinking about this in isolation. One can then make it an unrelated follow-up research task as to whether and how existing subsystems of Druid could be unified

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org