You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Jérémie Vexiau (JIRA)" <ji...@apache.org> on 2017/08/24 09:34:00 UTC
[jira] [Created] (BEAM-2803) JdbcIO read is very slow when query
return a lot of rows
Jérémie Vexiau created BEAM-2803:
------------------------------------
Summary: JdbcIO read is very slow when query return a lot of rows
Key: BEAM-2803
URL: https://issues.apache.org/jira/browse/BEAM-2803
Project: Beam
Issue Type: Improvement
Components: sdk-java-extensions
Affects Versions: Not applicable
Reporter: Jérémie Vexiau
Assignee: Reuven Lax
Fix For: Not applicable
Hi,
I'm using JdbcIO reader in batch mode with the postgresql driver.
my select query return more than 5 Millions rows
using cursors with Statement.setFetchSize().
these ParDo are OK :
{code:java}
.apply(ParDo.of(new ReadFn<>(this))).setCoder(getCoder())
.apply(ParDo.of(new DoFn<T, KV<Integer, T>>() {
private Random random;
@Setup
public void setup() {
random = new Random();
}
@ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}))
{code}
but reshuffle is very very slow.
it must be the GroupByKey with more than 5 millions of Key.
{code:java}
.apply(GroupByKey.<Integer, T>create())
{code}
is there a way to optimize the reshuffle, or use another method to prevent fusion ?
thanks in advance,
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)