You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Matthias Baetens <ma...@datatonic.com> on 2016/11/18 17:42:48 UTC

Apache Beam Java vs Python performance on Google Cloud

Hi Apache Beam users!

The last months I played around a bit with Google Dataflow/Apache Beam
(first in Java and lately in Python as well).

This week I did a quick implementation of the same pipeline in both Java
and Python involving some processing (String operations and int operations)
and a GroupBy using a Accumulator.

When running the pipeline on Google Cloud,  the Java pipeline performed 4-5
times faster than the Python pipeline. Now, this probably makes sense since
Python is in general slower than Java, but I was wondering if there is more
to it and how I could potentially profile the pipelines in a
(semi)-scientific way... Maybe some of you have thoughts/input or had
similar experiences? Happy to hear your input!

Best regards,

Matthias

Re: Apache Beam Java vs Python performance on Google Cloud

Posted by Matthias Baetens <ma...@datatonic.com>.
Hi everyone!

Thanks for all the replies, really appreciated.
@Lukasz: Indeed! I was aware of that but forgot to include this in my
initial e-mail, sorry about that. I am sure it will still improve over
time, but I was just wondering about the current state, since we (and some
of our clients as well) are interested in the Python SDK. It would be great
to have a tool to measure (and track) the performace improvements :)
@Jason: That's great news. I just joined the mailinglists, so I probably
missed out on that message. At the moment I implemented 3 pipelines: Java,
Python and Python with typehinting (since this could possibly increase
performance according to Google docs and it would be great to know by how
much). Furthermore, I think it would make sense to run the pipeline both
locally and in the cloud to see if the Runners have any influence on the
performance (and pinpoint the bottleneck more precisely). Writing an
algorithm that does the operations in Python and Java locally (without
Dataflow) also might make sense as this could give an insight in the
performance difference independent of Dataflow. Might be interesting to
include different Transforms as well (just transforming, grouping, ...) and
profile them seperately. I would be happy to help you with your endavours!
@Amir: Nice to hear you're looking into it as well, looking forward to see
your results!

Best regards,

Matthias

On Fri, Nov 18, 2016 at 8:55 PM, amir bahmanyari <am...@yahoo.com>
wrote:

> Hi Jason,
> How/what are you going to benchmark?
> I have been doing it for sometime.
> Want to make sure I know the objective gaps, if there is any.
> Thanks
> Amir-
>
>
> ------------------------------
> *From:* Jason Kuster <ja...@google.com>
> *To:* user@beam.incubator.apache.org
> *Sent:* Friday, November 18, 2016 11:28 AM
> *Subject:* Re: Apache Beam Java vs Python performance on Google Cloud
>
> Hi Matthias!
>
> Glad to hear you're interested in performance. I've been doing some
> investigation into benchmarking Beam over the last couple of weeks and I'm
> getting fairly close to having something I think will be workable, probably
> next week or the week after. I'm very interested in hearing opinions from
> the community (I solicited feedback from the dev list a few weeks ago but
> neglected to include user@), so I'd love to hear any thoughts you have.
>
> Best,
>
> Jason
>
> On Fri, Nov 18, 2016 at 11:11 AM, Lukasz Cwik <lc...@google.com> wrote:
>
> I would like to point out that the Java code has been around a lot longer
> and has had more time to be optimized while Python has been much more
> recent and is still having lots of changes with much larger improvements in
> performance. That gap between Python and Java has been steadily decreasing
> over the past couple of months.
>
> On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens <matthias.baetens@datatonic.
> com <ma...@datatonic.com>> wrote:
>
> Hi Apache Beam users!
>
> The last months I played around a bit with Google Dataflow/Apache Beam
> (first in Java and lately in Python as well).
>
> This week I did a quick implementation of the same pipeline in both Java
> and Python involving some processing (String operations and int operations)
> and a GroupBy using a Accumulator.
>
> When running the pipeline on Google Cloud,  the Java pipeline performed
> 4-5 times faster than the Python pipeline. Now, this probably makes sense
> since Python is in general slower than Java, but I was wondering if there
> is more to it and how I could potentially profile the pipelines in a
> (semi)-scientific way... Maybe some of you have thoughts/input or had
> similar experiences? Happy to hear your input!
>
> Best regards,
>
> Matthias
>
>
>
>
>
> --
> -------
> Jason Kuster
> Apache Beam (Incubating) / Google Cloud Dataflow
>
>
>


-- 

*Matthias Baetens*


*datatonic | data power unleashed*
office +44 203 668 3680  |  mobile +44 74 918 20646

Level24 | 1 Canada Square | Canary Wharf | E14 5AB London

Re: Apache Beam Java vs Python performance on Google Cloud

Posted by amir bahmanyari <am...@yahoo.com>.
Hi Jason,How/what are you going to benchmark?I have been doing it for sometime.Want to make sure I know the objective gaps, if there is any.ThanksAmir-

      From: Jason Kuster <ja...@google.com>
 To: user@beam.incubator.apache.org 
 Sent: Friday, November 18, 2016 11:28 AM
 Subject: Re: Apache Beam Java vs Python performance on Google Cloud
   
Hi Matthias!
Glad to hear you're interested in performance. I've been doing some investigation into benchmarking Beam over the last couple of weeks and I'm getting fairly close to having something I think will be workable, probably next week or the week after. I'm very interested in hearing opinions from the community (I solicited feedback from the dev list a few weeks ago but neglected to include user@), so I'd love to hear any thoughts you have.
Best,
Jason
On Fri, Nov 18, 2016 at 11:11 AM, Lukasz Cwik <lc...@google.com> wrote:

I would like to point out that the Java code has been around a lot longer and has had more time to be optimized while Python has been much more recent and is still having lots of changes with much larger improvements in performance. That gap between Python and Java has been steadily decreasing over the past couple of months.
On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens <matthias.baetens@datatonic. com> wrote:

Hi Apache Beam users!
The last months I played around a bit with Google Dataflow/Apache Beam (first in Java and lately in Python as well).
This week I did a quick implementation of the same pipeline in both Java and Python involving some processing (String operations and int operations) and a GroupBy using a Accumulator.
When running the pipeline on Google Cloud,  the Java pipeline performed 4-5 times faster than the Python pipeline. Now, this probably makes sense since Python is in general slower than Java, but I was wondering if there is more to it and how I could potentially profile the pipelines in a (semi)-scientific way... Maybe some of you have thoughts/input or had similar experiences? Happy to hear your input!
Best regards,
Matthias





-- 
-------Jason KusterApache Beam (Incubating) / Google Cloud Dataflow

   

Re: Apache Beam Java vs Python performance on Google Cloud

Posted by Jason Kuster <ja...@google.com>.
Hi Matthias!

Glad to hear you're interested in performance. I've been doing some
investigation into benchmarking Beam over the last couple of weeks and I'm
getting fairly close to having something I think will be workable, probably
next week or the week after. I'm very interested in hearing opinions from
the community (I solicited feedback from the dev list a few weeks ago but
neglected to include user@), so I'd love to hear any thoughts you have.

Best,

Jason

On Fri, Nov 18, 2016 at 11:11 AM, Lukasz Cwik <lc...@google.com> wrote:

> I would like to point out that the Java code has been around a lot longer
> and has had more time to be optimized while Python has been much more
> recent and is still having lots of changes with much larger improvements in
> performance. That gap between Python and Java has been steadily decreasing
> over the past couple of months.
>
> On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens <
> matthias.baetens@datatonic.com> wrote:
>
>> Hi Apache Beam users!
>>
>> The last months I played around a bit with Google Dataflow/Apache Beam
>> (first in Java and lately in Python as well).
>>
>> This week I did a quick implementation of the same pipeline in both Java
>> and Python involving some processing (String operations and int operations)
>> and a GroupBy using a Accumulator.
>>
>> When running the pipeline on Google Cloud,  the Java pipeline performed
>> 4-5 times faster than the Python pipeline. Now, this probably makes sense
>> since Python is in general slower than Java, but I was wondering if there
>> is more to it and how I could potentially profile the pipelines in a
>> (semi)-scientific way... Maybe some of you have thoughts/input or had
>> similar experiences? Happy to hear your input!
>>
>> Best regards,
>>
>> Matthias
>>
>
>


-- 
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow

Re: Apache Beam Java vs Python performance on Google Cloud

Posted by Lukasz Cwik <lc...@google.com>.
I would like to point out that the Java code has been around a lot longer
and has had more time to be optimized while Python has been much more
recent and is still having lots of changes with much larger improvements in
performance. That gap between Python and Java has been steadily decreasing
over the past couple of months.

On Fri, Nov 18, 2016 at 11:42 AM, Matthias Baetens <
matthias.baetens@datatonic.com> wrote:

> Hi Apache Beam users!
>
> The last months I played around a bit with Google Dataflow/Apache Beam
> (first in Java and lately in Python as well).
>
> This week I did a quick implementation of the same pipeline in both Java
> and Python involving some processing (String operations and int operations)
> and a GroupBy using a Accumulator.
>
> When running the pipeline on Google Cloud,  the Java pipeline performed
> 4-5 times faster than the Python pipeline. Now, this probably makes sense
> since Python is in general slower than Java, but I was wondering if there
> is more to it and how I could potentially profile the pipelines in a
> (semi)-scientific way... Maybe some of you have thoughts/input or had
> similar experiences? Happy to hear your input!
>
> Best regards,
>
> Matthias
>