You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Yannik Schr��der <sc...@gmail.com> on 2020/11/30 14:52:16 UTC

Flink Connector for Druid

Hello Druid developers,

We are a team of master students and we are considering to develop a Flink Connector for Druid as part of a university project.

After a couple of days of learning the druid basics we are thinking about a suitable architecture and are trying to figure out ways to access the data. 

For the implementation of the source, we started thinking about projection, selection and scan queries. 
Since Flink is potentially running lots of source tasks in parallel on multiple nodes, we'd like to directly access the data via the segments that can be received from the metadata through the broker, if our understanding is correct. However, searching through the API of Druid we didn't find any calls that go into the direction of directly accessing segments to read from them. 

We currently see two options: Either we push the query down to druid and intercept at the point where the data servers are consulted to collect the data (is this possible?). Or we execute the query on our own by talking to the different druid processes to get the metadata with relevant segments, hosts, etc.

So the question to you would be: Is there any mechanism in Druid that is intended for external systems to directly access the data in one way or another? Or do you have any other alternatives that would be useful in our case? How do other systems that have a druid source handle this problem? (We saw presto offers a druid source).

Another interesting question for us would be if Druid has some change data capture/changelogs that are consistently updated with the changes to the system. This would be an interesting case for an unbounded data stream source.

We appreciate the help!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
For additional commands, e-mail: dev-help@druid.apache.org


Re: Flink Connector for Druid

Posted by Slim Bouguerra <bs...@apache.org>.
Hi If my understanding is correct you are trying to read data in a parallel
way from Historicals (keep in mind you might need to read real time data as
well).
In this case you can do something similar to the Hive Druid integration by
using the Broker API http://%s/druid/v2/datasources/%s/candidates?intervals=%s
this will return all the segments informations including where stored.
See this for implementations details
https://github.com/apache/hive/blob/91ab242841879ca8133c1231ad124b48df6fa05b/druid-handler/src/java/org/apache/hadoop/hive/druid/io/DruidQueryBasedInputFormat.java#L196


On Mon, Nov 30, 2020 at 9:59 AM Yannik Schröder <
schroeder.yannik97@gmail.com> wrote:

> Hello Druid developers,
>
> We are a team of master students and we are considering to develop a Flink
> Connector for Druid as part of a university project.
>
> After a couple of days of learning the druid basics we are thinking about
> a suitable architecture and are trying to figure out ways to access the
> data.
>
> For the implementation of the source, we started thinking about
> projection, selection and scan queries.
> Since Flink is potentially running lots of source tasks in parallel on
> multiple nodes, we'd like to directly access the data via the segments that
> can be received from the metadata through the broker, if our understanding
> is correct. However, searching through the API of Druid we didn't find any
> calls that go into the direction of directly accessing segments to read
> from them.
>
> We currently see two options: Either we push the query down to druid and
> intercept at the point where the data servers are consulted to collect the
> data (is this possible?). Or we execute the query on our own by talking to
> the different druid processes to get the metadata with relevant segments,
> hosts, etc.
>
> So the question to you would be: Is there any mechanism in Druid that is
> intended for external systems to directly access the data in one way or
> another? Or do you have any other alternatives that would be useful in our
> case? How do other systems that have a druid source handle this problem?
> (We saw presto offers a druid source).
>
> Another interesting question for us would be if Druid has some change data
> capture/changelogs that are consistently updated with the changes to the
> system. This would be an interesting case for an unbounded data stream
> source.
>
> We appreciate the help!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> For additional commands, e-mail: dev-help@druid.apache.org
>
>