You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Braden Bassingthwaite (JIRA)" <ji...@apache.org> on 2018/11/23 01:06:00 UTC

[jira] [Created] (BEAM-6117) Dataflow Slowness

Braden Bassingthwaite created BEAM-6117:
-------------------------------------------

Summary: Dataflow Slowness
Key: BEAM-6117
URL: https://issues.apache.org/jira/browse/BEAM-6117
Project: Beam
Issue Type: Bug
Components: sdk-go
Reporter: Braden Bassingthwaite
Assignee: Robert Burke

This is a pretty open ended ticket but we've been struggling with this for quite some time and hoping we can get assistance in getting our issue resolved.

We wrote and contributed the datastore reader earlier this year and have been using it in our project in a couple of scenarios with success. The problem that we are facing is that our dataflows take a long time. We have datastore kinds that are 100M+ and they take 2-3 days to go over. We've try fiddling with all of the knobs available to us(datastore splits, cpus, turning off autoscaling, scope changes, updating libraries, etc...) and can't seem to make it go faster.

My only hunch is that within the datastore reader when viewing the status in dataflows ui. Is that we see:

Output collections
DailyListingScore/main.queryFn.out0
Elements added
–
Estimated size
–

I am assuming that these numbers would indicate to dataflow the progress that the step is making and scale up/down dependent on these numbers. Is this right? Or would these numbers have no bearing? We've tried starting the dataflow with 32+ workers and it will always scale down to 1-2 nodes after a couple of minutes. It seems as though dataflow isn't scaling up when it should. Any directions or assistance in getting this issue solved would be great!

Thanks

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)