You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Ismaël Mejía (JIRA)" <ji...@apache.org> on 2017/08/30 19:56:00 UTC

[jira] [Commented] (BEAM-2516) User reports 4 minutes to process 1 million line CSV in DirectRunner

    [ https://issues.apache.org/jira/browse/BEAM-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147945#comment-16147945 ] 

Ismaël Mejía commented on BEAM-2516:
------------------------------------

It seems that with all the ongoing work to support the Fn/Runner API we have introduced a regression in the performance of the Direct runner.

The classical quickstart wordcount with the kinglear.txt file (170KB) on my machine passed from 5s with Beam 2.1.0 to 126s using the current 2.2.0-SNAPSHOT.

I executed the same wordcount with different inputs and got these times:
File size: Beam 2.1.0 vs Beam 2.2.0-SNAPSHOT
0.17MB: 5s vs 126s 
1MB: 8s vs 149s
11MB: 28s vs 170s

From a quick view it seems that the regression does not seem to be because of the size of the input. I profiled the execution and noticed that GC and threads are OK, but CPU use is really high now because of Serialization on the different transforms. I suppose this is the price to pay to have all the multilanguage proto magic. I know that for example for a big batch job this ‘set-up’ time may be negligible but this is still a considerable regression.

The performance can be improved by avoiding the translation into the Runner API (in particular for a Java job which really does not benefit of it) but imagine this is not the goal, so maybe we need to explore other ways to tackle this by caching or avoiding some serialization weight.

> User reports 4 minutes to process 1 million line CSV in DirectRunner
> --------------------------------------------------------------------
>
>                 Key: BEAM-2516
>                 URL: https://issues.apache.org/jira/browse/BEAM-2516
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct
>            Reporter: Kenneth Knowles
>            Priority: Minor
>
> https://stackoverflow.com/questions/44736414/simple-apache-beam-manipulations-work-very-slow
> I don't know what the expectation are here, so I wasn't ready to say this is WAI. Low priority since it isn't what the runner is for anyhow, but this seems like the scale of data that should be snappy. Worth investigating, or maybe you can quickly indicate why it is expected?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)