You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2016/06/30 18:19:12 UTC

How to spin up Kafka using docker and use for Spark Streaming Integration tests

Hi,

I need to do integration tests using Spark Streaming. My idea is to spin up
kafka using docker locally and use it to feed the stream to my Streaming
Job. Any suggestions on how to do this would be of great help.

Thanks,
Swetha



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

Posted by swetha kasireddy <sw...@gmail.com>.

The application output is that it inserts data to cassandra at the end of
every batch.

On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson <la...@mapflat.com> wrote:

> I created such a setup for a client a few months ago. It is pretty
> straightforward, but it can take some work to get all the wires
> connected.
>
> I suggest that you start with the spotify/kafka
> (https://github.com/spotify/docker-kafka) Docker image, since it
> includes a bundled zookeeper. The alternative would be to spin up a
> separate Zookeeper Docker container and connect them, but for testing
> purposes, it would make the setup more complex.
>
> You'll need to inform Kafka about the external address it exposes by
> setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
> or the address printed by "ip addr show docker0" (Linux). I also
> suggest setting
> AUTO_CREATE_TOPICS to true.
>
> You can choose to run your Spark Streaming application under test
> (SUT) and your test harness also in Docker containers, or directly on
> your host.
>
> In the former case, it is easiest to set up a Docker Compose file
> linking the harness and SUT to Kafka. This variant provides better
> isolation, and might integrate better if you have existing similar
> test frameworks.
>
> If you want to run the harness and SUT outside Docker, I suggest that
> you build your harness with a standard test framework, e.g. scalatest
> or JUnit, and run both harness and SUT in the same JVM. In this case,
> you put code to bring up the Kafka Docker container in test framework
> setup methods. This test strategy integrates better with IDEs and
> build tools (mvn/sbt/gradle), since they will run (and debug) your
> tests without any special integration. I therefore prefer this
> strategy.
>
>
> What is the output of your application? If it is messages on a
> different Kafka topic, the test harness can merely subscribe and
> verify output. If you emit output to a database, you'll need another
> Docker container, integrated with Docker Compose. If you are emitting
> database entries, your test oracle will need to frequently poll the
> database for the expected records, with a timeout in order not to hang
> on failing tests.
>
> I hope this is comprehensible. Let me know if you have followup questions.
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS
>
>
>
> On Thu, Jun 30, 2016 at 8:19 PM, SRK <sw...@gmail.com> wrote:
> > Hi,
> >
> > I need to do integration tests using Spark Streaming. My idea is to spin
> up
> > kafka using docker locally and use it to feed the stream to my Streaming
> > Job. Any suggestions on how to do this would be of great help.
> >
> > Thanks,
> > Swetha
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

Posted by Lars Albertsson <la...@mapflat.com>.

Let us assume that you want to build an integration test setup where
you run all participating components in Docker.

You create a docker-compose.yml with four Docker images, something like this:

# Start docker-compose.yml
version: '2'

services:
  myapp:
    build: myapp_dir
    links:
      - kafka
      - cassandra

  kafka:
    image: spotify/kafka
    environment:
      - ADVERTISED_HOST
    ports:
      - "2181:2181"
      - "9092:9092"

  cassandra:
    image: spotify/cassandra
    environment:
      - <might need some tweaking here>
    ports:
      - "9042:9042"

  test_harness:
    build: test_harness_dir
    links:
      - kafka
      - cassandra
# End docker-compose.yml

I haven't used the spotify/cassandra image, so you might need to do
some environment variable plumbing to get it working.

Your test harness would then push messages to Kafka, and poll
Cassandra for the expected output. Your Spark Streaming application
has Spark installed on the
Docker image, and runs Spark with local master.

You need to run this on a machine that has Docker and Docker Compose
installed, typically a Ubuntu host. This machine can either be bare
metal or a full VM (Virtualbox, VMware, Xen), which is what you get if
you run in an IaaS cloud like GCE or EC2. Hence, your CI/CD Jenkins
machine should be a dedicated instance.

Developers with Macs would run docker-machine, which uses Virtualbox
IIRC. Developers with Linux machines can run Docker and Docker Compose
natively.

You can in theory run Jenkins in Docker and spin up new Docker
containers from inside Docker using some docker-inside-docker setup.
It will add complexity, however, and I suspect it will be brittle, so
I don't recommend it.

You could also in theory use some cloud container service that runs
your images during tests. They have different ways of welding Docker
images than Docker Compose, however, so it also increases complexity
and makes the CI/CD setup different than the setup on local developer
machines. I went down this path once, but I cannot recommend it.

If you instead want a setup where the test harness and your Spark
Streaming application runs outside Docker, you omit them from
docker-compose.yml, and have the test harness run docker-compose, and
figure out the ports and addresses to connect to. As mentioned
earlier, this requires more plumbing, but results in an integration
test setup that runs smoothly from Gradle/Maven/SBT and also from
IntelliJ.

I hope things are clearer. Let me know if you have further questions.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/6FBtlS

On Thu, Jul 7, 2016 at 3:14 AM, swetha kasireddy
<sw...@gmail.com> wrote:
> Can this docker image be used to spin up kafka cluster in a CI/CD pipeline
> like Jenkins to run the integration tests? Or it can be done only in the
> local machine that has docker installed? I assume that the box where the
> CI/CD pipeline runs should have docker installed correct?
>
> On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson <la...@mapflat.com> wrote:
>>
>> I created such a setup for a client a few months ago. It is pretty
>> straightforward, but it can take some work to get all the wires
>> connected.
>>
>> I suggest that you start with the spotify/kafka
>> (https://github.com/spotify/docker-kafka) Docker image, since it
>> includes a bundled zookeeper. The alternative would be to spin up a
>> separate Zookeeper Docker container and connect them, but for testing
>> purposes, it would make the setup more complex.
>>
>> You'll need to inform Kafka about the external address it exposes by
>> setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
>> or the address printed by "ip addr show docker0" (Linux). I also
>> suggest setting
>> AUTO_CREATE_TOPICS to true.
>>
>> You can choose to run your Spark Streaming application under test
>> (SUT) and your test harness also in Docker containers, or directly on
>> your host.
>>
>> In the former case, it is easiest to set up a Docker Compose file
>> linking the harness and SUT to Kafka. This variant provides better
>> isolation, and might integrate better if you have existing similar
>> test frameworks.
>>
>> If you want to run the harness and SUT outside Docker, I suggest that
>> you build your harness with a standard test framework, e.g. scalatest
>> or JUnit, and run both harness and SUT in the same JVM. In this case,
>> you put code to bring up the Kafka Docker container in test framework
>> setup methods. This test strategy integrates better with IDEs and
>> build tools (mvn/sbt/gradle), since they will run (and debug) your
>> tests without any special integration. I therefore prefer this
>> strategy.
>>
>>
>> What is the output of your application? If it is messages on a
>> different Kafka topic, the test harness can merely subscribe and
>> verify output. If you emit output to a database, you'll need another
>> Docker container, integrated with Docker Compose. If you are emitting
>> database entries, your test oracle will need to frequently poll the
>> database for the expected records, with a timeout in order not to hang
>> on failing tests.
>>
>> I hope this is comprehensible. Let me know if you have followup questions.
>>
>> Regards,
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS
>>
>>
>>
>> On Thu, Jun 30, 2016 at 8:19 PM, SRK <sw...@gmail.com> wrote:
>> > Hi,
>> >
>> > I need to do integration tests using Spark Streaming. My idea is to spin
>> > up
>> > kafka using docker locally and use it to feed the stream to my Streaming
>> > Job. Any suggestions on how to do this would be of great help.
>> >
>> > Thanks,
>> > Swetha
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

Posted by swetha kasireddy <sw...@gmail.com>.

Can this docker image be used to spin up kafka cluster in a CI/CD pipeline
like Jenkins to run the integration tests? Or it can be done only in the
local machine that has docker installed? I assume that the box where the
CI/CD pipeline runs should have docker installed correct?

On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson <la...@mapflat.com> wrote:

> I created such a setup for a client a few months ago. It is pretty
> straightforward, but it can take some work to get all the wires
> connected.
>
> I suggest that you start with the spotify/kafka
> (https://github.com/spotify/docker-kafka) Docker image, since it
> includes a bundled zookeeper. The alternative would be to spin up a
> separate Zookeeper Docker container and connect them, but for testing
> purposes, it would make the setup more complex.
>
> You'll need to inform Kafka about the external address it exposes by
> setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
> or the address printed by "ip addr show docker0" (Linux). I also
> suggest setting
> AUTO_CREATE_TOPICS to true.
>
> You can choose to run your Spark Streaming application under test
> (SUT) and your test harness also in Docker containers, or directly on
> your host.
>
> In the former case, it is easiest to set up a Docker Compose file
> linking the harness and SUT to Kafka. This variant provides better
> isolation, and might integrate better if you have existing similar
> test frameworks.
>
> If you want to run the harness and SUT outside Docker, I suggest that
> you build your harness with a standard test framework, e.g. scalatest
> or JUnit, and run both harness and SUT in the same JVM. In this case,
> you put code to bring up the Kafka Docker container in test framework
> setup methods. This test strategy integrates better with IDEs and
> build tools (mvn/sbt/gradle), since they will run (and debug) your
> tests without any special integration. I therefore prefer this
> strategy.
>
>
> What is the output of your application? If it is messages on a
> different Kafka topic, the test harness can merely subscribe and
> verify output. If you emit output to a database, you'll need another
> Docker container, integrated with Docker Compose. If you are emitting
> database entries, your test oracle will need to frequently poll the
> database for the expected records, with a timeout in order not to hang
> on failing tests.
>
> I hope this is comprehensible. Let me know if you have followup questions.
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS
>
>
>
> On Thu, Jun 30, 2016 at 8:19 PM, SRK <sw...@gmail.com> wrote:
> > Hi,
> >
> > I need to do integration tests using Spark Streaming. My idea is to spin
> up
> > kafka using docker locally and use it to feed the stream to my Streaming
> > Job. Any suggestions on how to do this would be of great help.
> >
> > Thanks,
> > Swetha
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

Posted by Lars Albertsson <la...@mapflat.com>.

I created such a setup for a client a few months ago. It is pretty
straightforward, but it can take some work to get all the wires
connected.

I suggest that you start with the spotify/kafka
(https://github.com/spotify/docker-kafka) Docker image, since it
includes a bundled zookeeper. The alternative would be to spin up a
separate Zookeeper Docker container and connect them, but for testing
purposes, it would make the setup more complex.

You'll need to inform Kafka about the external address it exposes by
setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
or the address printed by "ip addr show docker0" (Linux). I also
suggest setting
AUTO_CREATE_TOPICS to true.

You can choose to run your Spark Streaming application under test
(SUT) and your test harness also in Docker containers, or directly on
your host.

In the former case, it is easiest to set up a Docker Compose file
linking the harness and SUT to Kafka. This variant provides better
isolation, and might integrate better if you have existing similar
test frameworks.

If you want to run the harness and SUT outside Docker, I suggest that
you build your harness with a standard test framework, e.g. scalatest
or JUnit, and run both harness and SUT in the same JVM. In this case,
you put code to bring up the Kafka Docker container in test framework
setup methods. This test strategy integrates better with IDEs and
build tools (mvn/sbt/gradle), since they will run (and debug) your
tests without any special integration. I therefore prefer this
strategy.

What is the output of your application? If it is messages on a
different Kafka topic, the test harness can merely subscribe and
verify output. If you emit output to a database, you'll need another
Docker container, integrated with Docker Compose. If you are emitting
database entries, your test oracle will need to frequently poll the
database for the expected records, with a timeout in order not to hang
on failing tests.

I hope this is comprehensible. Let me know if you have followup questions.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/6FBtlS

On Thu, Jun 30, 2016 at 8:19 PM, SRK <sw...@gmail.com> wrote:
> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is to spin up
> kafka using docker locally and use it to feed the stream to my Streaming
> Job. Any suggestions on how to do this would be of great help.
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

Posted by Akhil Das <ak...@hacked.work>.

You can use this https://github.com/wurstmeister/kafka-docker to spin up a
kafka cluster and then point your sparkstreaming to it to consume from it.

On Fri, Jul 1, 2016 at 1:19 AM, SRK <sw...@gmail.com> wrote:

> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is to spin up
> kafka using docker locally and use it to feed the stream to my Streaming
> Job. Any suggestions on how to do this would be of great help.
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Cheers!