You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/08/24 15:38:00 UTC

[jira] [Commented] (BEAM-2803) JdbcIO read is very slow when query return a lot of rows

    [ https://issues.apache.org/jira/browse/BEAM-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140190#comment-16140190 ] 

Eugene Kirpichov commented on BEAM-2803:
----------------------------------------

Could you quantify "very slow" - what performance are you expecting, what performance are you getting, and how long does the query itself take (including fully fetching all query results), and how did you conclude that it's GroupByKey that is slow? And also, what runner are you using to run the pipeline?

> JdbcIO read is very slow when query return a lot of rows
> --------------------------------------------------------
>
>                 Key: BEAM-2803
>                 URL: https://issues.apache.org/jira/browse/BEAM-2803
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>    Affects Versions: Not applicable
>            Reporter: Jérémie Vexiau
>            Assignee: Reuven Lax
>              Labels: performance
>             Fix For: Not applicable
>
>
> Hi,
> I'm using JdbcIO reader in batch mode with the postgresql driver.
> my select query return more than 5 Millions rows
> using cursors with Statement.setFetchSize().
> these ParDo are OK :
> {code:java}
>           .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
>           .apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
>             private Random random;
>             @Setup
>             public void setup() {
>               random = new Random();
>             }
>             @ProcessElement
>             public void processElement(ProcessContext context) {
>               context.output(KV.of(random.nextInt(), context.element()));
>             }
>           }))
> {code}
> but reshuffle is very very slow. 
> it must be the GroupByKey with more than 5 millions of Key.
> {code:java}
> .apply(GroupByKey.<Integer, T>create())
> {code}
> is there a way to optimize the reshuffle, or use another method to prevent fusion ? 
> thanks in advance,



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)