You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Michal Charemza <mi...@charemza.name> on 2023/06/05 07:37:55 UTC

Re: [Question] Ingesting into PostgreSQL - advice/code on setup/teardown tasks?

In case curious/for posterity, so far I'm doing ingest into PostgreSQL
outside of Beam entirely using https://github.com/uktrade/pg-bulk-ingest.
It handles auto migration of the table, ingests data in batches where each
is a single transaction, and stores a high watermark against the table
that's read on subsequent ingests,

Whether or not this can somehow be integrated with Beam, not too sure.

On Sat, May 6, 2023 at 4:09 PM Michal Charemza <mi...@charemza.name> wrote:

> Ah thanks for your reply!
>
> So done manually/outside - I was hoping to have this in the pipeline
> really so that DDLs would be done in a transaction along with the data
> ingest. And even if not quite that, would be great to have all the
> stages/code defined and run by the same thing, and not have something
> running out of pipeline
>
> The automatic migration thing... will have a nose around. That's close to
> what I'm looking for. Although maybe not quite. I think what I would like
> is some sort of sink transform that accepts as a parameter a SQLAlchemy
> table definition, and automatically migrates any existing table to that
> (where it can). Although I don't know Beam quite enough yet to know if
> that's really what I want...
>
> (Note also - the word "application" here... there isn't really an
> application here - am using PostgreSQL as I guess a data warehouse)
>
> Michal
>
> On Sat, May 6, 2023 at 3:18 PM Pavel Solomin <p....@gmail.com>
> wrote:
>
>> Hello!
>>
>> Usually DDLs (create table / alter table) live outside of the
>> applications. From my experience I've seen that sort of tasks being done
>> either manually or via automations like Liquibase / Flyway. This is not
>> specific to Beam, it is a common pattern of backend / data engineering apps
>> development.
>>
>> Some applications may have support of most simple and conflict-free DDLs
>> like adding a nullable column without restarting the app itself.
>>
>> I remember I've seen some examples in Java and Python for Beam apps which
>> supported schema automatic migrations. Example:
>>
>>
>>
>> https://medium.com/inside-league/streaming-data-to-bigquery-with-dataflow-and-real-time-schema-updating-c7a3deba3bad
>>
>> I am not aware of automatic solutions for arbitrary schema changes though.
>>
>> On Saturday, 6 May 2023, Michal Charemza <mi...@charemza.name> wrote:
>> > I'm looking into using Beam to ingest from various sources into a
>> PostgreSQL database, but there is something that I don't quite know how to
>> fit into the Beam model. How to deal with "non data" tasks that would need
>> to happen before or after the pipeline proper?
>> > For example, creation of tables, renames of tables, migrations on
>> existing tables. Where should all this sort of code/logic live if the
>> fetch/ingestion of data is via Beam? Or - is this entirely outside of the
>> Beam model? It should happen before the pipeline, or after the pipeline,
>> but not as part of the pipeline?
>>
>> --
>> Best Regards,
>> Pavel Solomin
>>
>> Tel: +351 962 950 692 | Skype: pavel_solomin | Linkedin
>> <https://www.linkedin.com/in/pavelsolomin>
>>
>>
>>
>>
>>