You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Mark Striebeck <ma...@gmail.com> on 2022/05/17 20:40:30 UTC

Writing out results to MySQL database

Hi,

We have a datapipeline that produces ~400M datapoints each day. If we run
it without storing, it finishes in a little over an hour. If we run it and
store the datapoints in a MySQL database it takes several hours.

We are running on GCP dataflow, the MySQL instances are hosted GCP
instances. We are using mysql-beam-connector
<https://github.com/esakik/beam-mysql-connector>.

The pipeline writes ~5000 datapoints per second.

A couple of questions:

   - Does this throughput sound reasonable or could it be significantly
   improved by optimizing the database?
   - The pipeline runs several workers to write this out - and because it's
   a write operation they content for write access. Is it better to write out
   through just one worker and one connection?
   - Is it actually faster to write from the pipeline to pubsub or kafka or
   such and have a client on the other side which then writes in bulk?

Thanks for any ideas or pointers (no, I'm by no means an experienced DBA!!!)

     Mark

Re: Writing out results to MySQL database

Posted by Christian Battista <ch...@benchsci.com>.
Hi Mark - we faced a similar problem at my company (we sink our data into
BigQuery first from Dataflow and then run a PSQL COPY operation to import
in bulk).  Moving to bulk can help but the bottleneck is more likely going
to be your MySQL instance itself, so you might have to put your DBA hat
on.  The I/O itself is not too bad - table constraints and indexes will
generally be the source of the main slowdowns.  This might not be
feasible for you but we actually disable constraints/keys before running a
batch import and then reapply them after the insert is complete.  Re:
PubSub - Yes, you could decouple the fate of the dataflow job from the
MySQL ingest task by writing to pubsub as you suggest, but you're still
going to face the same rate limits (but switching to a queue-based approach
to this problem has its advantages).

Best,
-C

On Wed, May 18, 2022 at 9:11 AM Yomal de Silva <yo...@gmail.com>
wrote:

> Hi Mark,
> Writing to the db in bulk would be the first step. Have you looked into
> writing to the DB with a larger batch size. I believe mysql-beam-connector
> also supports this.
>
>
>
> On Wed, May 18, 2022 at 2:13 AM Mark Striebeck <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> We have a datapipeline that produces ~400M datapoints each day. If we run
>> it without storing, it finishes in a little over an hour. If we run it and
>> store the datapoints in a MySQL database it takes several hours.
>>
>> We are running on GCP dataflow, the MySQL instances are hosted GCP
>> instances. We are using mysql-beam-connector
>> <https://github.com/esakik/beam-mysql-connector>.
>>
>> The pipeline writes ~5000 datapoints per second.
>>
>> A couple of questions:
>>
>>    - Does this throughput sound reasonable or could it be significantly
>>    improved by optimizing the database?
>>    - The pipeline runs several workers to write this out - and because
>>    it's a write operation they content for write access. Is it better to write
>>    out through just one worker and one connection?
>>    - Is it actually faster to write from the pipeline to pubsub or kafka
>>    or such and have a client on the other side which then writes in bulk?
>>
>> Thanks for any ideas or pointers (no, I'm by no means an
>> experienced DBA!!!)
>>
>>      Mark
>>
>

-- 
Christian Battista
he/him
Senior Engineer II, Tech Lead
*BenchSci*
*www.benchsci.com <http://www.benchsci.com>*
*E: *christian@benchsci.com
BenchSci is a CIX Top 10 Growth company, certified Great Place To Work®️,
and now a Deloitte Technology Fast 50™ winner. Learn more
<https://www.benchsci.com/news/benchsci-named-a-deloitte-technology-fast-50-company>
.

Re: Writing out results to MySQL database

Posted by Yomal de Silva <yo...@gmail.com>.
Hi Mark,
Writing to the db in bulk would be the first step. Have you looked into
writing to the DB with a larger batch size. I believe mysql-beam-connector
also supports this.



On Wed, May 18, 2022 at 2:13 AM Mark Striebeck <ma...@gmail.com>
wrote:

> Hi,
>
> We have a datapipeline that produces ~400M datapoints each day. If we run
> it without storing, it finishes in a little over an hour. If we run it and
> store the datapoints in a MySQL database it takes several hours.
>
> We are running on GCP dataflow, the MySQL instances are hosted GCP
> instances. We are using mysql-beam-connector
> <https://github.com/esakik/beam-mysql-connector>.
>
> The pipeline writes ~5000 datapoints per second.
>
> A couple of questions:
>
>    - Does this throughput sound reasonable or could it be significantly
>    improved by optimizing the database?
>    - The pipeline runs several workers to write this out - and because
>    it's a write operation they content for write access. Is it better to write
>    out through just one worker and one connection?
>    - Is it actually faster to write from the pipeline to pubsub or kafka
>    or such and have a client on the other side which then writes in bulk?
>
> Thanks for any ideas or pointers (no, I'm by no means an
> experienced DBA!!!)
>
>      Mark
>