You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Vasu Gupta <de...@gmail.com> on 2020/02/29 16:52:03 UTC

JdbcIO for writing to Dynamic Schemas in Postgres

Hey folks,

Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) dynamically using Apache beam Java Framework? Currently, I can't find any property that I could set to JdbcIO transform for providing schema or maybe I am missing something.

Thanks

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Did you create a jira about that already ?

  

I will do the improvement on JdbcIO.

  

Regards

JB

  

Thanks

Regards

JB

  

Le dim. 31 mai 2020 ? 11:25, Willem Pienaar <me...@willem.co> a ?crit :  

> Hi Reuven,  
>

>

>  
>

>

> To be clear, we already have this solved for BigQueryIO. I am hoping there
is a similar solution for JdbcIO.  
>

>

>  
>

>

> Regards,  
>

>

> Willem

>

>  
>

>

> On Sun, May 31, 2020, at 12:42 PM, Reuven Lax wrote:  
>

>

>> This should be possible using the Beam programmatic API. You can pass
BigQueryIO a function that determines the BigQuery table based on the input
element.  
>

>>

>>  
>

>>

>> On Sat, May 30, 2020 at 9:20 PM Willem Pienaar
<[me@willem.co](mailto:me@willem.co)> wrote:  
>

>>

>>> Hi JB,  
>

>>>

>>>  
>

>>>

>>> Apologies for resurrecting this thread, but I have a related question.  
>

>>>

>>>  
>

>>>

>>> We've built a feature store Feast (<https://github.com/feast-dev/feast>)
primarily on Beam. We have been very happy with our decision to use Beam thus
far. Beam is mostly used as the ingestion layer that writes data into stores
(BigQuery, Redis). I am currently implementing JdbcIO (for PostgreSQL) and
it's working fine so far. I set up all the tables when the job is launched,
and I write into different tables depending on the input elements.  
>

>>>

>>>  
>

>>>

>>> However, a problem we are facing is that schema changes are happening very
rapidly based on our users' activity. Every time the user changes a collection
of features/fields, we have to launch a new Dataflow job in order to support
the new database schema. This can take 3-4 minutes. Every time the jobs are in
an updating state we have to block all user activity, which is quite
disruptive.  
>

>>>

>>>  
>

>>>

>>> What we want to do is dynamically configure the SQL insert statement based
on the input elements. This would allow us to keep the same job running
indefinitely, dramatically improving the user experience. We have found
solutions for BigQueryIO and our other IO, but not yet for JdbcIO. As far as I
can tell it isn't possible to modify the SQL insert statement to write to a
new table or to the same table with new columns, without restarting the job.  
>

>>>

>>>  
>

>>>

>>> Do you have any suggestions one how we can achieve the above? If it can't
be done with the current implementation, would it be reasonable to contribute
this functionality back to Beam?  
>

>>>

>>>  
>

>>>

>>> Regards,  
>

>>>

>>> Willem  
>

>>>

>>>  
>

>>>

>>> On Tue, Mar 3, 2020, at 1:30 AM, Jean-Baptiste Onofre wrote:  
>

>>>

>>> > Hi  
>

>>>

>>> >  
>

>>>

>>> > You have the setPrepareStatement() method where you define the target
tables.  
>

>>>

>>> > However, it?s in the same database (datasource) per pipeline.  
>

>>>

>>> >  
>

>>>

>>> > You can define several datasources and use a different datasource in  
>

>>>

>>> > each JdbcIO write. Meaning that you can divide in sub pipelines.  
>

>>>

>>> >  
>

>>>

>>> > Regards  
>

>>>

>>> > JB  
>

>>>

>>> >  
>

>>>

>>> > > Le 29 f?vr. 2020 ? 17:52, Vasu Gupta
<[dev.vasugupta@gmail.com](mailto:dev.vasugupta@gmail.com)> a ?crit :  
>

>>>

>>> > >  
>

>>>

>>> > > Hey folks,  
>

>>>

>>> > >  
>

>>>

>>> > > Can we use JdbcIO for writing data to multiple Schemas(For Postgres
Database) dynamically using Apache beam Java Framework? Currently, I can't
find any property that I could set to JdbcIO transform for providing schema or
maybe I am missing something.  
>

>>>

>>> > >  
>

>>>

>>> > > Thanks  
>

>>>

>>> >  
>

>>>

>>> >  
>

>

>  
>

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Reuven Lax <re...@google.com>.

It doesn't look to me like JdbcIO currently supports dynamic destinations.
I think it wouldn't be too hard to add this functionality. If you wanted to
help contribute this change to JdbcIO.java, I'm sure that we would be happy
to help guide you.

Reuven

On Sun, May 31, 2020 at 2:26 AM Willem Pienaar <me...@willem.co> wrote:

> Hi Reuven,
>
> To be clear, we already have this solved for BigQueryIO. I am hoping there
> is a similar solution for JdbcIO.
>
> Regards,
> Willem
>
> On Sun, May 31, 2020, at 12:42 PM, Reuven Lax wrote:
>
> This should be possible using the Beam programmatic API. You can pass
> BigQueryIO a function that determines the BigQuery table based on the input
> element.
>
> On Sat, May 30, 2020 at 9:20 PM Willem Pienaar <me...@willem.co> wrote:
>
> Hi JB,
>
> Apologies for resurrecting this thread, but I have a related question.
>
> We've built a feature store Feast (https://github.com/feast-dev/feast)
> primarily on Beam. We have been very happy with our decision to use Beam
> thus far. Beam is mostly used as the ingestion layer that writes data into
> stores (BigQuery, Redis). I am currently implementing JdbcIO (for
> PostgreSQL) and it's working fine so far. I set up all the tables when the
> job is launched, and I write into different tables depending on the input
> elements.
>
> However, a problem we are facing is that schema changes are happening very
> rapidly based on our users' activity. Every time the user changes a
> collection of features/fields, we have to launch a new Dataflow job in
> order to support the new database schema. This can take 3-4 minutes. Every
> time the jobs are in an updating state we have to block all user activity,
> which is quite disruptive.
>
> What we want to do is dynamically configure the SQL insert statement based
> on the input elements. This would allow us to keep the same job running
> indefinitely, dramatically improving the user experience. We have found
> solutions for BigQueryIO and our other IO, but not yet for JdbcIO. As far
> as I can tell it isn't possible to modify the SQL insert statement to write
> to a new table or to the same table with new columns, without restarting
> the job.
>
> Do you have any suggestions one how we can achieve the above? If it can't
> be done with the current implementation, would it be reasonable to
> contribute this functionality back to Beam?
>
> Regards,
> Willem
>
> On Tue, Mar 3, 2020, at 1:30 AM, Jean-Baptiste Onofre wrote:
> > Hi
> >
> > You have the setPrepareStatement() method where you define the target
> tables.
> > However, it’s in the same database (datasource) per pipeline.
> >
> > You can define several datasources and use a different datasource in
> > each JdbcIO write. Meaning that you can divide in sub pipelines.
> >
> > Regards
> > JB
> >
> > > Le 29 févr. 2020 à 17:52, Vasu Gupta <de...@gmail.com> a
> écrit :
> > >
> > > Hey folks,
> > >
> > > Can we use JdbcIO for writing data to multiple Schemas(For Postgres
> Database) dynamically using Apache beam Java Framework? Currently, I can't
> find any property that I could set to JdbcIO transform for providing schema
> or maybe I am missing something.
> > >
> > > Thanks
> >
> >
>
>
>

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Willem Pienaar <me...@willem.co>.

Hi Reuven,

To be clear, we already have this solved for BigQueryIO. I am hoping there is a similar solution for JdbcIO.

Regards,
Willem

On Sun, May 31, 2020, at 12:42 PM, Reuven Lax wrote:
> This should be possible using the Beam programmatic API. You can pass BigQueryIO a function that determines the BigQuery table based on the input element.
> 
> On Sat, May 30, 2020 at 9:20 PM Willem Pienaar <me...@willem.co> wrote:
>> Hi JB,
>> 
>>  Apologies for resurrecting this thread, but I have a related question.
>> 
>>  We've built a feature store Feast (https://github.com/feast-dev/feast) primarily on Beam. We have been very happy with our decision to use Beam thus far. Beam is mostly used as the ingestion layer that writes data into stores (BigQuery, Redis). I am currently implementing JdbcIO (for PostgreSQL) and it's working fine so far. I set up all the tables when the job is launched, and I write into different tables depending on the input elements.
>> 
>>  However, a problem we are facing is that schema changes are happening very rapidly based on our users' activity. Every time the user changes a collection of features/fields, we have to launch a new Dataflow job in order to support the new database schema. This can take 3-4 minutes. Every time the jobs are in an updating state we have to block all user activity, which is quite disruptive.
>> 
>>  What we want to do is dynamically configure the SQL insert statement based on the input elements. This would allow us to keep the same job running indefinitely, dramatically improving the user experience. We have found solutions for BigQueryIO and our other IO, but not yet for JdbcIO. As far as I can tell it isn't possible to modify the SQL insert statement to write to a new table or to the same table with new columns, without restarting the job.
>> 
>>  Do you have any suggestions one how we can achieve the above? If it can't be done with the current implementation, would it be reasonable to contribute this functionality back to Beam?
>> 
>>  Regards,
>>  Willem
>> 
>>  On Tue, Mar 3, 2020, at 1:30 AM, Jean-Baptiste Onofre wrote:
>>  > Hi
>>  > 
>>  > You have the setPrepareStatement() method where you define the target tables.
>>  > However, it’s in the same database (datasource) per pipeline.
>>  > 
>>  > You can define several datasources and use a different datasource in 
>>  > each JdbcIO write. Meaning that you can divide in sub pipelines.
>>  > 
>>  > Regards
>>  > JB
>>  > 
>>  > > Le 29 févr. 2020 à 17:52, Vasu Gupta <de...@gmail.com> a écrit :
>>  > > 
>>  > > Hey folks,
>>  > > 
>>  > > Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) dynamically using Apache beam Java Framework? Currently, I can't find any property that I could set to JdbcIO transform for providing schema or maybe I am missing something.
>>  > > 
>>  > > Thanks
>>  > 
>>  >

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Reuven Lax <re...@google.com>.

This should be possible using the Beam programmatic API. You can pass
BigQueryIO a function that determines the BigQuery table based on the input
element.

On Sat, May 30, 2020 at 9:20 PM Willem Pienaar <me...@willem.co> wrote:

> Hi JB,
>
> Apologies for resurrecting this thread, but I have a related question.
>
> We've built a feature store Feast (https://github.com/feast-dev/feast)
> primarily on Beam. We have been very happy with our decision to use Beam
> thus far. Beam is mostly used as the ingestion layer that writes data into
> stores (BigQuery, Redis). I am currently implementing JdbcIO (for
> PostgreSQL) and it's working fine so far. I set up all the tables when the
> job is launched, and I write into different tables depending on the input
> elements.
>
> However, a problem we are facing is that schema changes are happening very
> rapidly based on our users' activity. Every time the user changes a
> collection of features/fields, we have to launch a new Dataflow job in
> order to support the new database schema. This can take 3-4 minutes. Every
> time the jobs are in an updating state we have to block all user activity,
> which is quite disruptive.
>
> What we want to do is dynamically configure the SQL insert statement based
> on the input elements. This would allow us to keep the same job running
> indefinitely, dramatically improving the user experience. We have found
> solutions for BigQueryIO and our other IO, but not yet for JdbcIO. As far
> as I can tell it isn't possible to modify the SQL insert statement to write
> to a new table or to the same table with new columns, without restarting
> the job.
>
> Do you have any suggestions one how we can achieve the above? If it can't
> be done with the current implementation, would it be reasonable to
> contribute this functionality back to Beam?
>
> Regards,
> Willem
>
> On Tue, Mar 3, 2020, at 1:30 AM, Jean-Baptiste Onofre wrote:
> > Hi
> >
> > You have the setPrepareStatement() method where you define the target
> tables.
> > However, it’s in the same database (datasource) per pipeline.
> >
> > You can define several datasources and use a different datasource in
> > each JdbcIO write. Meaning that you can divide in sub pipelines.
> >
> > Regards
> > JB
> >
> > > Le 29 févr. 2020 à 17:52, Vasu Gupta <de...@gmail.com> a
> écrit :
> > >
> > > Hey folks,
> > >
> > > Can we use JdbcIO for writing data to multiple Schemas(For Postgres
> Database) dynamically using Apache beam Java Framework? Currently, I can't
> find any property that I could set to JdbcIO transform for providing schema
> or maybe I am missing something.
> > >
> > > Thanks
> >
> >
>

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Willem Pienaar <me...@willem.co>.

Hi JB,

Apologies for resurrecting this thread, but I have a related question.

We've built a feature store Feast (https://github.com/feast-dev/feast) primarily on Beam. We have been very happy with our decision to use Beam thus far. Beam is mostly used as the ingestion layer that writes data into stores (BigQuery, Redis). I am currently implementing JdbcIO (for PostgreSQL) and it's working fine so far. I set up all the tables when the job is launched, and I write into different tables depending on the input elements.

However, a problem we are facing is that schema changes are happening very rapidly based on our users' activity. Every time the user changes a collection of features/fields, we have to launch a new Dataflow job in order to support the new database schema. This can take 3-4 minutes. Every time the jobs are in an updating state we have to block all user activity, which is quite disruptive.

What we want to do is dynamically configure the SQL insert statement based on the input elements. This would allow us to keep the same job running indefinitely, dramatically improving the user experience. We have found solutions for BigQueryIO and our other IO, but not yet for JdbcIO. As far as I can tell it isn't possible to modify the SQL insert statement to write to a new table or to the same table with new columns, without restarting the job.

Do you have any suggestions one how we can achieve the above? If it can't be done with the current implementation, would it be reasonable to contribute this functionality back to Beam?

Regards,
Willem

On Tue, Mar 3, 2020, at 1:30 AM, Jean-Baptiste Onofre wrote:
> Hi
> 
> You have the setPrepareStatement() method where you define the target tables.
> However, it’s in the same database (datasource) per pipeline.
> 
> You can define several datasources and use a different datasource in 
> each JdbcIO write. Meaning that you can divide in sub pipelines.
> 
> Regards
> JB
> 
> > Le 29 févr. 2020 à 17:52, Vasu Gupta <de...@gmail.com> a écrit :
> > 
> > Hey folks,
> > 
> > Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) dynamically using Apache beam Java Framework? Currently, I can't find any property that I could set to JdbcIO transform for providing schema or maybe I am missing something.
> > 
> > Thanks
> 
>

Re: JdbcIO for writing to Dynamic Schemas in Postgres

Posted by Jean-Baptiste Onofre <jb...@nanthrax.net>.

Hi

You have the setPrepareStatement() method where you define the target tables.
However, it’s in the same database (datasource) per pipeline.

You can define several datasources and use a different datasource in each JdbcIO write. Meaning that you can divide in sub pipelines.

Regards
JB

> Le 29 févr. 2020 à 17:52, Vasu Gupta <de...@gmail.com> a écrit :
> 
> Hey folks,
> 
> Can we use JdbcIO for writing data to multiple Schemas(For Postgres Database) dynamically using Apache beam Java Framework? Currently, I can't find any property that I could set to JdbcIO transform for providing schema or maybe I am missing something.
> 
> Thanks