You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Krystian Kichewko <kr...@gmail.com> on 2020/04/11 18:44:43 UTC

Word Count example execution duration difference

Hello!

I'm trying to learn Apache Beam, and I was looking into examples,, when I
noticed something unusual:

It seems that "word count" example is much faster using Python than Java.

Python example pipeline on King Lear:

real 0m9.294s
user 0m2.822s
sys 0m0.370s

Java example pipeline on King Lear:

real 1m35.780s
user 4m10.089s
sys 0m1.743s

As you can see it is 10 sec vs 105 sec real time, and it uses even more CPU
time because it uses all of CPU cores.

Is this some kind of limitation of Java's direct runner? Or am I doing
something wrong? Is this intended? Should I file a bug?

Or maybe this difference is eliminated on real life pipelines?

I got similar results when testing using Google Colab:
https://beam.apache.org/get-started/try-apache-beam/

When you execute on all Shakespeare's books in the bucket
(gs://dataflow-samples/shakespeare/*) the difference is even greater:

Python:

real 0m47.900s
user 0m18.350s
sys 0m0.579s

Java:

real 14m28.201s
user 28m3.206s
sys 0m7.597s


How to reproduce:

Python 3.7:

docker run -it --rm python:3.7-buster /bin/bash
pip3 install apache-beam[gcp]
mkdir -p /tmp/foo
cd /tmp/foo
time python -m apache_beam.examples.wordcount --input
gs://dataflow-samples/shakespeare/kinglear.txt --output ./count

real 0m9.294s
user 0m2.822s
sys 0m0.370s


Java:

docker run -it --rm ubuntu:16.04 /bin/bash
apt update
apt install default-jdk maven
mkdir -p /tmp/foo
cd /tmp/foo
mvn archetype:generate -DarchetypeGroupId=org.apache.beam
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
-DarchetypeVersion=2.19.0 -DgroupId=org.example
-DartifactId=word-count-beam -Dversion="0.1"
-Dpackage=org.apache.beam.examples -DinteractiveMode=false
cd word-count-beam
mvn compile
time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
--output=counts" -Pdirect-runner

Execute twice because the first time maven will download dependencies:

time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
-Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
--output=counts" -Pdirect-runner

real 1m35.780s
user 4m10.089s
sys 0m1.743s


Thanks,
Krystian Kichewko

Re: Word Count example execution duration difference

Posted by Luke Cwik <lc...@google.com>.
The Java DirectRunner enforces additional strictness checks that are
expensive (such as re-encoding all elements to make sure that the coder is
compatible).

Retry your run with --enforceImmutability=false --enforceEncodability=false

On Sat, Apr 11, 2020 at 11:45 AM Krystian Kichewko <
krystiankichewko@gmail.com> wrote:

> Hello!
>
> I'm trying to learn Apache Beam, and I was looking into examples,, when I
> noticed something unusual:
>
> It seems that "word count" example is much faster using Python than Java.
>
> Python example pipeline on King Lear:
>
> real 0m9.294s
> user 0m2.822s
> sys 0m0.370s
>
> Java example pipeline on King Lear:
>
> real 1m35.780s
> user 4m10.089s
> sys 0m1.743s
>
> As you can see it is 10 sec vs 105 sec real time, and it uses even more
> CPU time because it uses all of CPU cores.
>
> Is this some kind of limitation of Java's direct runner? Or am I doing
> something wrong? Is this intended? Should I file a bug?
>
> Or maybe this difference is eliminated on real life pipelines?
>
> I got similar results when testing using Google Colab:
> https://beam.apache.org/get-started/try-apache-beam/
>
> When you execute on all Shakespeare's books in the bucket
> (gs://dataflow-samples/shakespeare/*) the difference is even greater:
>
> Python:
>
> real 0m47.900s
> user 0m18.350s
> sys 0m0.579s
>
> Java:
>
> real 14m28.201s
> user 28m3.206s
> sys 0m7.597s
>
>
> How to reproduce:
>
> Python 3.7:
>
> docker run -it --rm python:3.7-buster /bin/bash
> pip3 install apache-beam[gcp]
> mkdir -p /tmp/foo
> cd /tmp/foo
> time python -m apache_beam.examples.wordcount --input
> gs://dataflow-samples/shakespeare/kinglear.txt --output ./count
>
> real 0m9.294s
> user 0m2.822s
> sys 0m0.370s
>
>
> Java:
>
> docker run -it --rm ubuntu:16.04 /bin/bash
> apt update
> apt install default-jdk maven
> mkdir -p /tmp/foo
> cd /tmp/foo
> mvn archetype:generate -DarchetypeGroupId=org.apache.beam
> -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
> -DarchetypeVersion=2.19.0 -DgroupId=org.example
> -DartifactId=word-count-beam -Dversion="0.1"
> -Dpackage=org.apache.beam.examples -DinteractiveMode=false
> cd word-count-beam
> mvn compile
> time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
> -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
> --output=counts" -Pdirect-runner
>
> Execute twice because the first time maven will download dependencies:
>
> time mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount
> -Dexec.args="--inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt
> --output=counts" -Pdirect-runner
>
> real 1m35.780s
> user 4m10.089s
> sys 0m1.743s
>
>
> Thanks,
> Krystian Kichewko
>