You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by Rajiv Mordani <rm...@vmware.com.INVALID> on 2019/02/06 20:04:09 UTC

Spark batch with Druid

Is there a best practice for how to load data from druid to use in a spark batch job? I asked this question on the user alias but got no response hence reposting here.


  *   Rajiv

Re: Spark batch with Druid

Posted by Gian Merlino <gi...@apache.org>.

I'd guess the majority of users are just using Druid itself to process
Druid data, although there are a few people out there that export it into
other systems using techniques like the above.

On Wed, Feb 13, 2019 at 2:00 PM Rajiv Mordani <rm...@vmware.com.invalid>
wrote:

> Am curious to know how people are generally processing data from druid? We
> want to be able to spark processing in a distributed fashion using
> Dataframes.
>
> - Rajiv
>
> On 2/11/19, 1:04 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID> wrote:
>
>     Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet
> of
>     objects parsed from the JSON (something like
>     `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a
> Druid
>     response, but I would definitely recommend looking through the code
> that
>     Gian posted instead - it reads data from deep storage instead of
> sending an
>     HTTP request to the Druid cluster and waiting for the response.
>
>     On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani
> <rm...@vmware.com.invalid>
>     wrote:
>
>     > Thanks Julian,
>     >         See some questions in-line:
>     >
>     > On 2/6/19, 3:01 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID>
> wrote:
>     >
>     >     I think this question is going the other way (e.g. how to read
> data
>     > into
>     >     Spark, as opposed to into Druid). For that, the quickest and
> dirtiest
>     >     approach is probably to use Spark's json support to parse a Druid
>     > response.
>     >
>     > [Rajiv] Can you please expand more here?
>     >
>     >     You may also be able to repurpose some code from
>     >
>     >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&amp;reserved=0
> ,
>     > but I don't think
>     >     there's any official guidance on this.
>     >
>     >
>     >
>     >     On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org>
> wrote:
>     >
>     >     > Hey Rajiv,
>     >     >
>     >     > There's an unofficial Druid/Spark adapter at:
>     >     >
>     >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&amp;reserved=0
> .
>     > If you want to stick to
>     >     > official things, then the best approach would be to use Spark
> to
>     > write data
>     >     > to HDFS or S3 and then ingest it into Druid using Druid's
>     > Hadoop-based or
>     >     > native batch ingestion. (Or even write it to Kafka using Spark
>     > Streaming
>     >     > and ingest from Kafka into Druid using Druid's Kafka indexing
>     > service.)
>     >     >
>     >     > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
>     > <rmordani@vmware.com.invalid
>     >     > >
>     >     > wrote:
>     >     >
>     >     > > Is there a best practice for how to load data from druid to
> use in
>     > a
>     >     > spark
>     >     > > batch job? I asked this question on the user alias but got no
>     > response
>     >     > > hence reposting here.
>     >     > >
>     >     > >
>     >     > >   *   Rajiv
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Spark batch with Druid

Posted by Rajiv Mordani <rm...@vmware.com.INVALID>.

Am curious to know how people are generally processing data from druid? We want to be able to spark processing in a distributed fashion using Dataframes.

- Rajiv

On 2/11/19, 1:04 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID> wrote:

    Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet of
    objects parsed from the JSON (something like
    `sparkSession.read.json(jsonStringRDD)`). You could hook this up to a Druid
    response, but I would definitely recommend looking through the code that
    Gian posted instead - it reads data from deep storage instead of sending an
    HTTP request to the Druid cluster and waiting for the response.
    
    On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani <rm...@vmware.com.invalid>
    wrote:
    
    > Thanks Julian,
    >         See some questions in-line:
    >
    > On 2/6/19, 3:01 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID> wrote:
    >
    >     I think this question is going the other way (e.g. how to read data
    > into
    >     Spark, as opposed to into Druid). For that, the quickest and dirtiest
    >     approach is probably to use Spark's json support to parse a Druid
    > response.
    >
    > [Rajiv] Can you please expand more here?
    >
    >     You may also be able to repurpose some code from
    >
    > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=9Uq3ox5hhes60fxfqMOxmjfQPZdwFrfSs7glVLTafs0%3D&amp;reserved=0,
    > but I don't think
    >     there's any official guidance on this.
    >
    >
    >
    >     On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org> wrote:
    >
    >     > Hey Rajiv,
    >     >
    >     > There's an unofficial Druid/Spark adapter at:
    >     >
    > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7C4b7f159a82db4dc4fdc008d690647969%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636855158547887488&amp;sdata=OFHEl0qFx5g8csFcjz5qnfU67bw37reST%2BYY%2BqzDLk8%3D&amp;reserved=0.
    > If you want to stick to
    >     > official things, then the best approach would be to use Spark to
    > write data
    >     > to HDFS or S3 and then ingest it into Druid using Druid's
    > Hadoop-based or
    >     > native batch ingestion. (Or even write it to Kafka using Spark
    > Streaming
    >     > and ingest from Kafka into Druid using Druid's Kafka indexing
    > service.)
    >     >
    >     > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
    > <rmordani@vmware.com.invalid
    >     > >
    >     > wrote:
    >     >
    >     > > Is there a best practice for how to load data from druid to use in
    > a
    >     > spark
    >     > > batch job? I asked this question on the user alias but got no
    > response
    >     > > hence reposting here.
    >     > >
    >     > >
    >     > >   *   Rajiv
    >     > >
    >     >
    >
    >
    >

Re: Spark batch with Druid

Posted by Julian Jaffe <jj...@pinterest.com.INVALID>.

Spark can convert an RDD of JSON strings into an RDD/DataFrame/DataSet of
objects parsed from the JSON (something like
`sparkSession.read.json(jsonStringRDD)`). You could hook this up to a Druid
response, but I would definitely recommend looking through the code that
Gian posted instead - it reads data from deep storage instead of sending an
HTTP request to the Druid cluster and waiting for the response.

On Sat, Feb 9, 2019 at 5:02 PM Rajiv Mordani <rm...@vmware.com.invalid>
wrote:

> Thanks Julian,
>         See some questions in-line:
>
> On 2/6/19, 3:01 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID> wrote:
>
>     I think this question is going the other way (e.g. how to read data
> into
>     Spark, as opposed to into Druid). For that, the quickest and dirtiest
>     approach is probably to use Spark's json support to parse a Druid
> response.
>
> [Rajiv] Can you please expand more here?
>
>     You may also be able to repurpose some code from
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&amp;data=02%7C01%7Crmordani%40vmware.com%7Cdac469891e6143eb417208d68c87161c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636850909153478697&amp;sdata=YwEJLohvwCI%2FGnjtlH%2BP6BgnLLketOJnhp8IGZey2d4%3D&amp;reserved=0,
> but I don't think
>     there's any official guidance on this.
>
>
>
>     On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org> wrote:
>
>     > Hey Rajiv,
>     >
>     > There's an unofficial Druid/Spark adapter at:
>     >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7Cdac469891e6143eb417208d68c87161c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636850909153478697&amp;sdata=WnaiBpvr%2B4%2BrkFGZPhcZJ%2BpbrxkkzyAv8vi7cql5GZA%3D&amp;reserved=0.
> If you want to stick to
>     > official things, then the best approach would be to use Spark to
> write data
>     > to HDFS or S3 and then ingest it into Druid using Druid's
> Hadoop-based or
>     > native batch ingestion. (Or even write it to Kafka using Spark
> Streaming
>     > and ingest from Kafka into Druid using Druid's Kafka indexing
> service.)
>     >
>     > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
> <rmordani@vmware.com.invalid
>     > >
>     > wrote:
>     >
>     > > Is there a best practice for how to load data from druid to use in
> a
>     > spark
>     > > batch job? I asked this question on the user alias but got no
> response
>     > > hence reposting here.
>     > >
>     > >
>     > >   *   Rajiv
>     > >
>     >
>
>
>

Re: Spark batch with Druid

Posted by Rajiv Mordani <rm...@vmware.com.INVALID>.

Thanks Julian,
	See some questions in-line:

On 2/6/19, 3:01 PM, "Julian Jaffe" <jj...@pinterest.com.INVALID> wrote:

    I think this question is going the other way (e.g. how to read data into
    Spark, as opposed to into Druid). For that, the quickest and dirtiest
    approach is probably to use Spark's json support to parse a Druid response.

[Rajiv] Can you please expand more here?

    You may also be able to repurpose some code from
    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSparklineData%2Fspark-druid-olap&amp;data=02%7C01%7Crmordani%40vmware.com%7Cdac469891e6143eb417208d68c87161c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636850909153478697&amp;sdata=YwEJLohvwCI%2FGnjtlH%2BP6BgnLLketOJnhp8IGZey2d4%3D&amp;reserved=0, but I don't think
    there's any official guidance on this.


    
    On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org> wrote:
    
    > Hey Rajiv,
    >
    > There's an unofficial Druid/Spark adapter at:
    > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmetamx%2Fdruid-spark-batch&amp;data=02%7C01%7Crmordani%40vmware.com%7Cdac469891e6143eb417208d68c87161c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636850909153478697&amp;sdata=WnaiBpvr%2B4%2BrkFGZPhcZJ%2BpbrxkkzyAv8vi7cql5GZA%3D&amp;reserved=0. If you want to stick to
    > official things, then the best approach would be to use Spark to write data
    > to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
    > native batch ingestion. (Or even write it to Kafka using Spark Streaming
    > and ingest from Kafka into Druid using Druid's Kafka indexing service.)
    >
    > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani <rmordani@vmware.com.invalid
    > >
    > wrote:
    >
    > > Is there a best practice for how to load data from druid to use in a
    > spark
    > > batch job? I asked this question on the user alias but got no response
    > > hence reposting here.
    > >
    > >
    > >   *   Rajiv
    > >
    >

Re: Spark batch with Druid

Posted by Gian Merlino <gi...@apache.org>.

Ah, you're right. I misread the original question.

In that case, also try checking out:
https://github.com/implydata/druid-hadoop-inputformat, an unofficial Druid
InputFormat. Spark can use that to read Druid data into an RDD - check the
example in the README. It's also unofficial and, currently, unmaintained,
so you'd be taking on some maintenance effort if you want to use it.

On Wed, Feb 6, 2019 at 3:01 PM Julian Jaffe <jj...@pinterest.com.invalid>
wrote:

> I think this question is going the other way (e.g. how to read data into
> Spark, as opposed to into Druid). For that, the quickest and dirtiest
> approach is probably to use Spark's json support to parse a Druid response.
> You may also be able to repurpose some code from
> https://github.com/SparklineData/spark-druid-olap, but I don't think
> there's any official guidance on this.
>
> On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Rajiv,
> >
> > There's an unofficial Druid/Spark adapter at:
> > https://github.com/metamx/druid-spark-batch. If you want to stick to
> > official things, then the best approach would be to use Spark to write
> data
> > to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
> > native batch ingestion. (Or even write it to Kafka using Spark Streaming
> > and ingest from Kafka into Druid using Druid's Kafka indexing service.)
> >
> > On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani
> <rmordani@vmware.com.invalid
> > >
> > wrote:
> >
> > > Is there a best practice for how to load data from druid to use in a
> > spark
> > > batch job? I asked this question on the user alias but got no response
> > > hence reposting here.
> > >
> > >
> > >   *   Rajiv
> > >
> >
>

Re: Spark batch with Druid

Posted by Julian Jaffe <jj...@pinterest.com.INVALID>.

I think this question is going the other way (e.g. how to read data into
Spark, as opposed to into Druid). For that, the quickest and dirtiest
approach is probably to use Spark's json support to parse a Druid response.
You may also be able to repurpose some code from
https://github.com/SparklineData/spark-druid-olap, but I don't think
there's any official guidance on this.

On Wed, Feb 6, 2019 at 2:21 PM Gian Merlino <gi...@apache.org> wrote:

> Hey Rajiv,
>
> There's an unofficial Druid/Spark adapter at:
> https://github.com/metamx/druid-spark-batch. If you want to stick to
> official things, then the best approach would be to use Spark to write data
> to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
> native batch ingestion. (Or even write it to Kafka using Spark Streaming
> and ingest from Kafka into Druid using Druid's Kafka indexing service.)
>
> On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani <rmordani@vmware.com.invalid
> >
> wrote:
>
> > Is there a best practice for how to load data from druid to use in a
> spark
> > batch job? I asked this question on the user alias but got no response
> > hence reposting here.
> >
> >
> >   *   Rajiv
> >
>

Re: Spark batch with Druid

Posted by Gian Merlino <gi...@apache.org>.

Hey Rajiv,

There's an unofficial Druid/Spark adapter at:
https://github.com/metamx/druid-spark-batch. If you want to stick to
official things, then the best approach would be to use Spark to write data
to HDFS or S3 and then ingest it into Druid using Druid's Hadoop-based or
native batch ingestion. (Or even write it to Kafka using Spark Streaming
and ingest from Kafka into Druid using Druid's Kafka indexing service.)

On Wed, Feb 6, 2019 at 12:04 PM Rajiv Mordani <rm...@vmware.com.invalid>
wrote:

> Is there a best practice for how to load data from druid to use in a spark
> batch job? I asked this question on the user alias but got no response
> hence reposting here.
>
>
>   *   Rajiv
>