You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Eugene Kirpichov (JIRA)" <ji...@apache.org> on 2017/08/26 17:44:00 UTC
[jira] [Comment Edited] (BEAM-2803) JdbcIO read is very slow when
query return a lot of rows
[ https://issues.apache.org/jira/browse/BEAM-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16142875#comment-16142875 ]
Eugene Kirpichov edited comment on BEAM-2803 at 8/26/17 5:43 PM:
-----------------------------------------------------------------
Hmm, indeed, seems that shuffle is being quite slow here. 1 worker is ending up writing about 32GB of data from query results, and shuffling that on 1 worker is slow. Maybe try this? https://cloud.google.com/blog/big-data/2017/06/introducing-cloud-dataflow-shuffle-for-up-to-5x-performance-improvement-in-data-analytic-pipelines
was (Author: jkff):
Hmm, indeed, seems that shuffle is being quite slow here. Maybe try this? https://cloud.google.com/blog/big-data/2017/06/introducing-cloud-dataflow-shuffle-for-up-to-5x-performance-improvement-in-data-analytic-pipelines
> JdbcIO read is very slow when query return a lot of rows
> --------------------------------------------------------
>
> Key: BEAM-2803
> URL: https://issues.apache.org/jira/browse/BEAM-2803
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-extensions
> Affects Versions: Not applicable
> Reporter: Jérémie Vexiau
> Assignee: Reuven Lax
> Labels: performance
> Fix For: Not applicable
>
> Attachments: test1500K.png, test1M.png, test2M.jpg, test500k.png
>
>
> Hi,
> I'm using JdbcIO reader in batch mode with the postgresql driver.
> my select query return more than 5 Millions rows
> using cursors with Statement.setFetchSize().
> these ParDo are OK :
> {code:java}
> .apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
> .apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
> private Random random;
> @Setup
> public void setup() {
> random = new Random();
> }
> @ProcessElement
> public void processElement(ProcessContext context) {
> context.output(KV.of(random.nextInt(), context.element()));
> }
> }))
> {code}
> but reshuffle is very very slow.
> it must be the GroupByKey with more than 5 millions of Key.
> {code:java}
> .apply(GroupByKey.<Integer, T>create())
> {code}
> is there a way to optimize the reshuffle, or use another method to prevent fusion ?
> thanks in advance,
> edit:
> I add some tests
> I use google dataflow as runner, with 1 worker, 2 max, and workerMachineType n1-standard-2
> and autoscalingAlgorithm THROUGHPUT_BASED
> First one : query return 500 000 results :
> !test500k.png!
> as we can see,
> parDo(Read) is about 1300 r/s
> groupByKey is about 1080 r/s
> 2nd : query return 1 000 000 results
> !test1M.png!
> parDo(read) => 1480 r/s
> groupByKey => 634 r/s
> 3rd : query return 1 500 000 results
> !test1500K.png!
> parDo(read) => 1700 r/s
> groupByKey => 565 r/s
> 4th query return 2 000 000 results
> !test2M.jpg!
> parDo(read) => 1485 r/s
> groupByKey => 537 r/s
> As we can see, groupByKey rate decrease when number of record are more important.
> ps: 2nd worker start just after ParDo(read) is succeed
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)