You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Eugene Kirpichov <ki...@google.com.INVALID> on 2017/09/05 01:59:58 UTC

Re: writing status

Hi,

Sorry for the delay. So sounds like you want to do something after writing
a window of data to BigQuery is complete.
I think this should be possible: expansion of BigQueryIO.write() returns a
WriteResult and you can apply other transforms to it. Have you tried that?

On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:

> I have documents from a mongo db that i need to migrate to bigquery.
> Since it is mongodb i do not know they schema ahead of time, so i have
> two pipelines, one to run over the documents and update the bigquery
> schema, then wait a few minutes (i can take for bigquery to be able to
> use the new schema) then with the other pipline copy all the
> documents.
> To know as to where i got with the different piplines i have a status
> table so that at the start i know from where to continue.
> So i need the option to update the status table with the success of
> the copy and some time value of the last copied document
>
>
> chaim
>
> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
> > I'd like to know more about your both use cases, can you clarify? I think
> > making sinks output something that can be waited on by another pipeline
> > step is a reasonable request, but more details would help refine this
> > suggestion.
> >
> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <ch...@apache.org>
> > wrote:
> >
> >> Can you do this from the program that runs the Beam job, after job is
> >> complete (you might have to use a blocking runner or poll for the
> status of
> >> the job) ?
> >>
> >> - Cham
> >>
> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
> wrote:
> >>
> >> > I also have a similar use case (but with BigTable) that I feel like I
> had
> >> > to hack up to make work.  It'd be great to hear if there is a way to
> do
> >> > something like this already, or if there are plans in the future.
> >> >
> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
> wrote:
> >> >
> >> > > Hi,
> >> > >   I have a few piplines that are an ETL from different systems to
> >> > bigquery.
> >> > > I would like to write the status of the ETL after all records have
> >> > > been updated to the bigquery.
> >> > > The problem is that writing to bigquery is a sink and you cannot
> have
> >> > > any other steps after the sink.
> >> > > I tried a sideoutput, but this is called in no correlation to the
> >> > > writing to bigquery, so i don't know if it succeeded or failed.
> >> > >
> >> > >
> >> > > any ideas?
> >> > > chaim
> >> > >
> >> >
> >>
>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

also i think the getFailedInserts does not work. I expected for the
write to work, and for getFailedInserts to return the records that did
not, but the whole batch failed

chaim

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <ki...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamikara@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

Re: writing status

Posted by Steve Niemitz <sn...@apache.org>.

I like the idea of having sinks return PCollection<Void>, which is very
similar to how I changed the BigtableIO.Write interface.  I can submit a
pull request for BigtableIO which does that once I test it out a little
more.

On Sat, Sep 9, 2017 at 1:46 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> Hi Steve,
> Unfortunately for BigQuery it's more complicated than that. Rows aren't
> written to BigQuery one by one (unless you're using streaming inserts,
> which are way more expensive and are usually used only in streaming
> pipelines) - they are written to files, and then a BigQuery import job, or
> several import jobs if there are too many files, picks them up. We can
> declare writing complete when all of the BigQuery import jobs have
> successfully completed.
> However, the method of writing is an implementation detail of BigQuery, so
> we need to create an API that works regardless of the method (import jobs
> vs. streaming inserts).
> Another complication is triggering - windows can fire multiple times. This
> rules out any approaches that sequence using side inputs, because side
> inputs don't have triggering.
>
> I think a common approach could be to return a PCollection<Void>,
> containing a Void in every window and pane that has been successfully
> written. This could be implemented in both modes and could be a general
> design patterns for this sort of thing. It just isn't easy to implement, so
> I didn't have time to take it on. It also could turn out to have other
> complications we haven't thought of yet.
>
> That said, if somebody tried to implement this for some connectors (not
> necessarily BigQuery) and pioneered the approach, it would be a great
> contribution.
>
> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org> wrote:
>
> > I wonder if it makes sense to start simple and go from there.  For
> example,
> > I enhanced BigtableIO.Write to output the number of rows written
> > in finishBundle(), simply into the global window with the current
> > timestamp.  This was more than enough to unblock me, but doesn't support
> > more complicated scenarios with windowing.
> >
> > However, as I said it was more than enough to solve the general batch use
> > case, and I imagine could be enhanced to support windowing by keeping
> track
> > of which windows were written per bundle. (can there even ever be more
> than
> > one window per bundle?)
> >
> > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > kirpichov@google.com.invalid> wrote:
> >
> > > Hi,
> > > I was going to implement this, but discussed it with +Reuven Lax
> > > <re...@google.com> and it appears to be quite difficult to do
> properly,
> > or
> > > even to define what it means at all, especially if you're using the
> > > streaming inserts write method. So for now there is no workaround
> except
> > > programmatically waiting for your whole pipeline to finish
> > > (pipeline.run().waitUntilFinish()).
> > >
> > > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
> > >
> > > > is there a way around this for now?
> > > > how can i get a snapshot version?
> > > >
> > > > chaim
> > > >
> > > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > > > <ki...@google.com.invalid> wrote:
> > > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > > > >
> > > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
> > wrote:
> > > > >
> > > > >> WriteResult does not support apply -> that is the problem
> > > > >>
> > > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > > > >> <ki...@google.com.invalid> wrote:
> > > > >> > Hi,
> > > > >> >
> > > > >> > Sorry for the delay. So sounds like you want to do something
> after
> > > > >> writing
> > > > >> > a window of data to BigQuery is complete.
> > > > >> > I think this should be possible: expansion of BigQueryIO.write()
> > > > returns
> > > > >> a
> > > > >> > WriteResult and you can apply other transforms to it. Have you
> > tried
> > > > >> that?
> > > > >> >
> > > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
> > > > wrote:
> > > > >> >
> > > > >> >> I have documents from a mongo db that i need to migrate to
> > > bigquery.
> > > > >> >> Since it is mongodb i do not know they schema ahead of time,
> so i
> > > > have
> > > > >> >> two pipelines, one to run over the documents and update the
> > > bigquery
> > > > >> >> schema, then wait a few minutes (i can take for bigquery to be
> > able
> > > > to
> > > > >> >> use the new schema) then with the other pipline copy all the
> > > > >> >> documents.
> > > > >> >> To know as to where i got with the different piplines i have a
> > > status
> > > > >> >> table so that at the start i know from where to continue.
> > > > >> >> So i need the option to update the status table with the
> success
> > of
> > > > >> >> the copy and some time value of the last copied document
> > > > >> >>
> > > > >> >>
> > > > >> >> chaim
> > > > >> >>
> > > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > > > >> >> <ki...@google.com.invalid> wrote:
> > > > >> >> > I'd like to know more about your both use cases, can you
> > > clarify? I
> > > > >> think
> > > > >> >> > making sinks output something that can be waited on by
> another
> > > > >> pipeline
> > > > >> >> > step is a reasonable request, but more details would help
> > refine
> > > > this
> > > > >> >> > suggestion.
> > > > >> >> >
> > > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > > > >> chamikara@apache.org>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> >> Can you do this from the program that runs the Beam job,
> after
> > > > job is
> > > > >> >> >> complete (you might have to use a blocking runner or poll
> for
> > > the
> > > > >> >> status of
> > > > >> >> >> the job) ?
> > > > >> >> >>
> > > > >> >> >> - Cham
> > > > >> >> >>
> > > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > > > sniemitz@apache.org>
> > > > >> >> wrote:
> > > > >> >> >>
> > > > >> >> >> > I also have a similar use case (but with BigTable) that I
> > feel
> > > > >> like I
> > > > >> >> had
> > > > >> >> >> > to hack up to make work.  It'd be great to hear if there
> is
> > a
> > > > way
> > > > >> to
> > > > >> >> do
> > > > >> >> >> > something like this already, or if there are plans in the
> > > > future.
> > > > >> >> >> >
> > > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > > chaim@behalf.com
> > > > >
> > > > >> >> wrote:
> > > > >> >> >> >
> > > > >> >> >> > > Hi,
> > > > >> >> >> > >   I have a few piplines that are an ETL from different
> > > > systems to
> > > > >> >> >> > bigquery.
> > > > >> >> >> > > I would like to write the status of the ETL after all
> > > records
> > > > >> have
> > > > >> >> >> > > been updated to the bigquery.
> > > > >> >> >> > > The problem is that writing to bigquery is a sink and
> you
> > > > cannot
> > > > >> >> have
> > > > >> >> >> > > any other steps after the sink.
> > > > >> >> >> > > I tried a sideoutput, but this is called in no
> correlation
> > > to
> > > > the
> > > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > > > failed.
> > > > >> >> >> > >
> > > > >> >> >> > >
> > > > >> >> >> > > any ideas?
> > > > >> >> >> > > chaim
> > > > >> >> >> > >
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

so how does the getFailedInserts method work (though from what i saw
it does not work)

chaim

On Sat, Sep 9, 2017 at 9:49 PM, Reuven Lax <re...@google.com.invalid> wrote:
> I'm still not sure how this would work (or even make sense) for the
> streaming-write path.
>
> Also in both paths, the actual write to BigQuery is unwindowed.
>
> On Sat, Sep 9, 2017 at 11:44 AM, Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> There'd be 1 Void per pane per window, so I could extract information
>> about whether this is the first pane, last pane, or something else - there
>> are probably use cases for each of these.
>>
>> On Sat, Sep 9, 2017 at 11:37 AM Reuven Lax <re...@google.com> wrote:
>>
>>> How would you know how many Voids to wait for downstream?
>>>
>>> On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <ki...@google.com>
>>> wrote:
>>>
>>>> Hi Steve,
>>>> Unfortunately for BigQuery it's more complicated than that. Rows aren't
>>>> written to BigQuery one by one (unless you're using streaming inserts,
>>>> which are way more expensive and are usually used only in streaming
>>>> pipelines) - they are written to files, and then a BigQuery import job, or
>>>> several import jobs if there are too many files, picks them up. We can
>>>> declare writing complete when all of the BigQuery import jobs have
>>>> successfully completed.
>>>> However, the method of writing is an implementation detail of BigQuery,
>>>> so we need to create an API that works regardless of the method (import
>>>> jobs vs. streaming inserts).
>>>> Another complication is triggering - windows can fire multiple times.
>>>> This rules out any approaches that sequence using side inputs, because side
>>>> inputs don't have triggering.
>>>>
>>>> I think a common approach could be to return a PCollection<Void>,
>>>> containing a Void in every window and pane that has been successfully
>>>> written. This could be implemented in both modes and could be a general
>>>> design patterns for this sort of thing. It just isn't easy to implement, so
>>>> I didn't have time to take it on. It also could turn out to have other
>>>> complications we haven't thought of yet.
>>>>
>>>> That said, if somebody tried to implement this for some connectors (not
>>>> necessarily BigQuery) and pioneered the approach, it would be a great
>>>> contribution.
>>>>
>>>> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org>
>>>> wrote:
>>>>
>>>>> I wonder if it makes sense to start simple and go from there.  For
>>>>> example,
>>>>> I enhanced BigtableIO.Write to output the number of rows written
>>>>> in finishBundle(), simply into the global window with the current
>>>>> timestamp.  This was more than enough to unblock me, but doesn't support
>>>>> more complicated scenarios with windowing.
>>>>>
>>>>> However, as I said it was more than enough to solve the general batch
>>>>> use
>>>>> case, and I imagine could be enhanced to support windowing by keeping
>>>>> track
>>>>> of which windows were written per bundle. (can there even ever be more
>>>>> than
>>>>> one window per bundle?)
>>>>>
>>>>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>>>>> kirpichov@google.com.invalid> wrote:
>>>>>
>>>>> > Hi,
>>>>> > I was going to implement this, but discussed it with +Reuven Lax
>>>>> > <re...@google.com> and it appears to be quite difficult to do
>>>>> properly, or
>>>>> > even to define what it means at all, especially if you're using the
>>>>> > streaming inserts write method. So for now there is no workaround
>>>>> except
>>>>> > programmatically waiting for your whole pipeline to finish
>>>>> > (pipeline.run().waitUntilFinish()).
>>>>> >
>>>>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>>>> >
>>>>> > > is there a way around this for now?
>>>>> > > how can i get a snapshot version?
>>>>> > >
>>>>> > > chaim
>>>>> > >
>>>>> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>>>>> > > <ki...@google.com.invalid> wrote:
>>>>> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>>>>> > > >
>>>>> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
>>>>> wrote:
>>>>> > > >
>>>>> > > >> WriteResult does not support apply -> that is the problem
>>>>> > > >>
>>>>> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>>>>> > > >> <ki...@google.com.invalid> wrote:
>>>>> > > >> > Hi,
>>>>> > > >> >
>>>>> > > >> > Sorry for the delay. So sounds like you want to do something
>>>>> after
>>>>> > > >> writing
>>>>> > > >> > a window of data to BigQuery is complete.
>>>>> > > >> > I think this should be possible: expansion of
>>>>> BigQueryIO.write()
>>>>> > > returns
>>>>> > > >> a
>>>>> > > >> > WriteResult and you can apply other transforms to it. Have you
>>>>> tried
>>>>> > > >> that?
>>>>> > > >> >
>>>>> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <chaim@behalf.com
>>>>> >
>>>>> > > wrote:
>>>>> > > >> >
>>>>> > > >> >> I have documents from a mongo db that i need to migrate to
>>>>> > bigquery.
>>>>> > > >> >> Since it is mongodb i do not know they schema ahead of time,
>>>>> so i
>>>>> > > have
>>>>> > > >> >> two pipelines, one to run over the documents and update the
>>>>> > bigquery
>>>>> > > >> >> schema, then wait a few minutes (i can take for bigquery to
>>>>> be able
>>>>> > > to
>>>>> > > >> >> use the new schema) then with the other pipline copy all the
>>>>> > > >> >> documents.
>>>>> > > >> >> To know as to where i got with the different piplines i have a
>>>>> > status
>>>>> > > >> >> table so that at the start i know from where to continue.
>>>>> > > >> >> So i need the option to update the status table with the
>>>>> success of
>>>>> > > >> >> the copy and some time value of the last copied document
>>>>> > > >> >>
>>>>> > > >> >>
>>>>> > > >> >> chaim
>>>>> > > >> >>
>>>>> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>>>>> > > >> >> <ki...@google.com.invalid> wrote:
>>>>> > > >> >> > I'd like to know more about your both use cases, can you
>>>>> > clarify? I
>>>>> > > >> think
>>>>> > > >> >> > making sinks output something that can be waited on by
>>>>> another
>>>>> > > >> pipeline
>>>>> > > >> >> > step is a reasonable request, but more details would help
>>>>> refine
>>>>> > > this
>>>>> > > >> >> > suggestion.
>>>>> > > >> >> >
>>>>> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>>>>> > > >> chamikara@apache.org>
>>>>> > > >> >> > wrote:
>>>>> > > >> >> >
>>>>> > > >> >> >> Can you do this from the program that runs the Beam job,
>>>>> after
>>>>> > > job is
>>>>> > > >> >> >> complete (you might have to use a blocking runner or poll
>>>>> for
>>>>> > the
>>>>> > > >> >> status of
>>>>> > > >> >> >> the job) ?
>>>>> > > >> >> >>
>>>>> > > >> >> >> - Cham
>>>>> > > >> >> >>
>>>>> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>>>>> > > sniemitz@apache.org>
>>>>> > > >> >> wrote:
>>>>> > > >> >> >>
>>>>> > > >> >> >> > I also have a similar use case (but with BigTable) that
>>>>> I feel
>>>>> > > >> like I
>>>>> > > >> >> had
>>>>> > > >> >> >> > to hack up to make work.  It'd be great to hear if there
>>>>> is a
>>>>> > > way
>>>>> > > >> to
>>>>> > > >> >> do
>>>>> > > >> >> >> > something like this already, or if there are plans in the
>>>>> > > future.
>>>>> > > >> >> >> >
>>>>> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>>>>> > chaim@behalf.com
>>>>> > > >
>>>>> > > >> >> wrote:
>>>>> > > >> >> >> >
>>>>> > > >> >> >> > > Hi,
>>>>> > > >> >> >> > >   I have a few piplines that are an ETL from different
>>>>> > > systems to
>>>>> > > >> >> >> > bigquery.
>>>>> > > >> >> >> > > I would like to write the status of the ETL after all
>>>>> > records
>>>>> > > >> have
>>>>> > > >> >> >> > > been updated to the bigquery.
>>>>> > > >> >> >> > > The problem is that writing to bigquery is a sink and
>>>>> you
>>>>> > > cannot
>>>>> > > >> >> have
>>>>> > > >> >> >> > > any other steps after the sink.
>>>>> > > >> >> >> > > I tried a sideoutput, but this is called in no
>>>>> correlation
>>>>> > to
>>>>> > > the
>>>>> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>>>>> > > failed.
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> > > any ideas?
>>>>> > > >> >> >> > > chaim
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> >
>>>>> > > >> >> >>
>>>>> > > >> >>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>

Re: writing status

Posted by Reuven Lax <re...@google.com.INVALID>.

I'm still not sure how this would work (or even make sense) for the
streaming-write path.

Also in both paths, the actual write to BigQuery is unwindowed.

On Sat, Sep 9, 2017 at 11:44 AM, Eugene Kirpichov <ki...@google.com>
wrote:

> There'd be 1 Void per pane per window, so I could extract information
> about whether this is the first pane, last pane, or something else - there
> are probably use cases for each of these.
>
> On Sat, Sep 9, 2017 at 11:37 AM Reuven Lax <re...@google.com> wrote:
>
>> How would you know how many Voids to wait for downstream?
>>
>> On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <ki...@google.com>
>> wrote:
>>
>>> Hi Steve,
>>> Unfortunately for BigQuery it's more complicated than that. Rows aren't
>>> written to BigQuery one by one (unless you're using streaming inserts,
>>> which are way more expensive and are usually used only in streaming
>>> pipelines) - they are written to files, and then a BigQuery import job, or
>>> several import jobs if there are too many files, picks them up. We can
>>> declare writing complete when all of the BigQuery import jobs have
>>> successfully completed.
>>> However, the method of writing is an implementation detail of BigQuery,
>>> so we need to create an API that works regardless of the method (import
>>> jobs vs. streaming inserts).
>>> Another complication is triggering - windows can fire multiple times.
>>> This rules out any approaches that sequence using side inputs, because side
>>> inputs don't have triggering.
>>>
>>> I think a common approach could be to return a PCollection<Void>,
>>> containing a Void in every window and pane that has been successfully
>>> written. This could be implemented in both modes and could be a general
>>> design patterns for this sort of thing. It just isn't easy to implement, so
>>> I didn't have time to take it on. It also could turn out to have other
>>> complications we haven't thought of yet.
>>>
>>> That said, if somebody tried to implement this for some connectors (not
>>> necessarily BigQuery) and pioneered the approach, it would be a great
>>> contribution.
>>>
>>> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org>
>>> wrote:
>>>
>>>> I wonder if it makes sense to start simple and go from there.  For
>>>> example,
>>>> I enhanced BigtableIO.Write to output the number of rows written
>>>> in finishBundle(), simply into the global window with the current
>>>> timestamp.  This was more than enough to unblock me, but doesn't support
>>>> more complicated scenarios with windowing.
>>>>
>>>> However, as I said it was more than enough to solve the general batch
>>>> use
>>>> case, and I imagine could be enhanced to support windowing by keeping
>>>> track
>>>> of which windows were written per bundle. (can there even ever be more
>>>> than
>>>> one window per bundle?)
>>>>
>>>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>>>> kirpichov@google.com.invalid> wrote:
>>>>
>>>> > Hi,
>>>> > I was going to implement this, but discussed it with +Reuven Lax
>>>> > <re...@google.com> and it appears to be quite difficult to do
>>>> properly, or
>>>> > even to define what it means at all, especially if you're using the
>>>> > streaming inserts write method. So for now there is no workaround
>>>> except
>>>> > programmatically waiting for your whole pipeline to finish
>>>> > (pipeline.run().waitUntilFinish()).
>>>> >
>>>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>>> >
>>>> > > is there a way around this for now?
>>>> > > how can i get a snapshot version?
>>>> > >
>>>> > > chaim
>>>> > >
>>>> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>>>> > > <ki...@google.com.invalid> wrote:
>>>> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>>>> > > >
>>>> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
>>>> wrote:
>>>> > > >
>>>> > > >> WriteResult does not support apply -> that is the problem
>>>> > > >>
>>>> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>>>> > > >> <ki...@google.com.invalid> wrote:
>>>> > > >> > Hi,
>>>> > > >> >
>>>> > > >> > Sorry for the delay. So sounds like you want to do something
>>>> after
>>>> > > >> writing
>>>> > > >> > a window of data to BigQuery is complete.
>>>> > > >> > I think this should be possible: expansion of
>>>> BigQueryIO.write()
>>>> > > returns
>>>> > > >> a
>>>> > > >> > WriteResult and you can apply other transforms to it. Have you
>>>> tried
>>>> > > >> that?
>>>> > > >> >
>>>> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <chaim@behalf.com
>>>> >
>>>> > > wrote:
>>>> > > >> >
>>>> > > >> >> I have documents from a mongo db that i need to migrate to
>>>> > bigquery.
>>>> > > >> >> Since it is mongodb i do not know they schema ahead of time,
>>>> so i
>>>> > > have
>>>> > > >> >> two pipelines, one to run over the documents and update the
>>>> > bigquery
>>>> > > >> >> schema, then wait a few minutes (i can take for bigquery to
>>>> be able
>>>> > > to
>>>> > > >> >> use the new schema) then with the other pipline copy all the
>>>> > > >> >> documents.
>>>> > > >> >> To know as to where i got with the different piplines i have a
>>>> > status
>>>> > > >> >> table so that at the start i know from where to continue.
>>>> > > >> >> So i need the option to update the status table with the
>>>> success of
>>>> > > >> >> the copy and some time value of the last copied document
>>>> > > >> >>
>>>> > > >> >>
>>>> > > >> >> chaim
>>>> > > >> >>
>>>> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>>>> > > >> >> <ki...@google.com.invalid> wrote:
>>>> > > >> >> > I'd like to know more about your both use cases, can you
>>>> > clarify? I
>>>> > > >> think
>>>> > > >> >> > making sinks output something that can be waited on by
>>>> another
>>>> > > >> pipeline
>>>> > > >> >> > step is a reasonable request, but more details would help
>>>> refine
>>>> > > this
>>>> > > >> >> > suggestion.
>>>> > > >> >> >
>>>> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>>>> > > >> chamikara@apache.org>
>>>> > > >> >> > wrote:
>>>> > > >> >> >
>>>> > > >> >> >> Can you do this from the program that runs the Beam job,
>>>> after
>>>> > > job is
>>>> > > >> >> >> complete (you might have to use a blocking runner or poll
>>>> for
>>>> > the
>>>> > > >> >> status of
>>>> > > >> >> >> the job) ?
>>>> > > >> >> >>
>>>> > > >> >> >> - Cham
>>>> > > >> >> >>
>>>> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>>>> > > sniemitz@apache.org>
>>>> > > >> >> wrote:
>>>> > > >> >> >>
>>>> > > >> >> >> > I also have a similar use case (but with BigTable) that
>>>> I feel
>>>> > > >> like I
>>>> > > >> >> had
>>>> > > >> >> >> > to hack up to make work.  It'd be great to hear if there
>>>> is a
>>>> > > way
>>>> > > >> to
>>>> > > >> >> do
>>>> > > >> >> >> > something like this already, or if there are plans in the
>>>> > > future.
>>>> > > >> >> >> >
>>>> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>>>> > chaim@behalf.com
>>>> > > >
>>>> > > >> >> wrote:
>>>> > > >> >> >> >
>>>> > > >> >> >> > > Hi,
>>>> > > >> >> >> > >   I have a few piplines that are an ETL from different
>>>> > > systems to
>>>> > > >> >> >> > bigquery.
>>>> > > >> >> >> > > I would like to write the status of the ETL after all
>>>> > records
>>>> > > >> have
>>>> > > >> >> >> > > been updated to the bigquery.
>>>> > > >> >> >> > > The problem is that writing to bigquery is a sink and
>>>> you
>>>> > > cannot
>>>> > > >> >> have
>>>> > > >> >> >> > > any other steps after the sink.
>>>> > > >> >> >> > > I tried a sideoutput, but this is called in no
>>>> correlation
>>>> > to
>>>> > > the
>>>> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>>>> > > failed.
>>>> > > >> >> >> > >
>>>> > > >> >> >> > >
>>>> > > >> >> >> > > any ideas?
>>>> > > >> >> >> > > chaim
>>>> > > >> >> >> > >
>>>> > > >> >> >> >
>>>> > > >> >> >>
>>>> > > >> >>
>>>> > > >>
>>>> > >
>>>> >
>>>>
>>>
>>

Re: writing status

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

There'd be 1 Void per pane per window, so I could extract information about
whether this is the first pane, last pane, or something else - there are
probably use cases for each of these.

On Sat, Sep 9, 2017 at 11:37 AM Reuven Lax <re...@google.com> wrote:

> How would you know how many Voids to wait for downstream?
>
> On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> Hi Steve,
>> Unfortunately for BigQuery it's more complicated than that. Rows aren't
>> written to BigQuery one by one (unless you're using streaming inserts,
>> which are way more expensive and are usually used only in streaming
>> pipelines) - they are written to files, and then a BigQuery import job, or
>> several import jobs if there are too many files, picks them up. We can
>> declare writing complete when all of the BigQuery import jobs have
>> successfully completed.
>> However, the method of writing is an implementation detail of BigQuery,
>> so we need to create an API that works regardless of the method (import
>> jobs vs. streaming inserts).
>> Another complication is triggering - windows can fire multiple times.
>> This rules out any approaches that sequence using side inputs, because side
>> inputs don't have triggering.
>>
>> I think a common approach could be to return a PCollection<Void>,
>> containing a Void in every window and pane that has been successfully
>> written. This could be implemented in both modes and could be a general
>> design patterns for this sort of thing. It just isn't easy to implement, so
>> I didn't have time to take it on. It also could turn out to have other
>> complications we haven't thought of yet.
>>
>> That said, if somebody tried to implement this for some connectors (not
>> necessarily BigQuery) and pioneered the approach, it would be a great
>> contribution.
>>
>> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org> wrote:
>>
>>> I wonder if it makes sense to start simple and go from there.  For
>>> example,
>>> I enhanced BigtableIO.Write to output the number of rows written
>>> in finishBundle(), simply into the global window with the current
>>> timestamp.  This was more than enough to unblock me, but doesn't support
>>> more complicated scenarios with windowing.
>>>
>>> However, as I said it was more than enough to solve the general batch use
>>> case, and I imagine could be enhanced to support windowing by keeping
>>> track
>>> of which windows were written per bundle. (can there even ever be more
>>> than
>>> one window per bundle?)
>>>
>>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>>> kirpichov@google.com.invalid> wrote:
>>>
>>> > Hi,
>>> > I was going to implement this, but discussed it with +Reuven Lax
>>> > <re...@google.com> and it appears to be quite difficult to do
>>> properly, or
>>> > even to define what it means at all, especially if you're using the
>>> > streaming inserts write method. So for now there is no workaround
>>> except
>>> > programmatically waiting for your whole pipeline to finish
>>> > (pipeline.run().waitUntilFinish()).
>>> >
>>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>> >
>>> > > is there a way around this for now?
>>> > > how can i get a snapshot version?
>>> > >
>>> > > chaim
>>> > >
>>> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>>> > > <ki...@google.com.invalid> wrote:
>>> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>>> > > >
>>> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
>>> wrote:
>>> > > >
>>> > > >> WriteResult does not support apply -> that is the problem
>>> > > >>
>>> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>>> > > >> <ki...@google.com.invalid> wrote:
>>> > > >> > Hi,
>>> > > >> >
>>> > > >> > Sorry for the delay. So sounds like you want to do something
>>> after
>>> > > >> writing
>>> > > >> > a window of data to BigQuery is complete.
>>> > > >> > I think this should be possible: expansion of BigQueryIO.write()
>>> > > returns
>>> > > >> a
>>> > > >> > WriteResult and you can apply other transforms to it. Have you
>>> tried
>>> > > >> that?
>>> > > >> >
>>> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
>>> > > wrote:
>>> > > >> >
>>> > > >> >> I have documents from a mongo db that i need to migrate to
>>> > bigquery.
>>> > > >> >> Since it is mongodb i do not know they schema ahead of time,
>>> so i
>>> > > have
>>> > > >> >> two pipelines, one to run over the documents and update the
>>> > bigquery
>>> > > >> >> schema, then wait a few minutes (i can take for bigquery to be
>>> able
>>> > > to
>>> > > >> >> use the new schema) then with the other pipline copy all the
>>> > > >> >> documents.
>>> > > >> >> To know as to where i got with the different piplines i have a
>>> > status
>>> > > >> >> table so that at the start i know from where to continue.
>>> > > >> >> So i need the option to update the status table with the
>>> success of
>>> > > >> >> the copy and some time value of the last copied document
>>> > > >> >>
>>> > > >> >>
>>> > > >> >> chaim
>>> > > >> >>
>>> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>>> > > >> >> <ki...@google.com.invalid> wrote:
>>> > > >> >> > I'd like to know more about your both use cases, can you
>>> > clarify? I
>>> > > >> think
>>> > > >> >> > making sinks output something that can be waited on by
>>> another
>>> > > >> pipeline
>>> > > >> >> > step is a reasonable request, but more details would help
>>> refine
>>> > > this
>>> > > >> >> > suggestion.
>>> > > >> >> >
>>> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>>> > > >> chamikara@apache.org>
>>> > > >> >> > wrote:
>>> > > >> >> >
>>> > > >> >> >> Can you do this from the program that runs the Beam job,
>>> after
>>> > > job is
>>> > > >> >> >> complete (you might have to use a blocking runner or poll
>>> for
>>> > the
>>> > > >> >> status of
>>> > > >> >> >> the job) ?
>>> > > >> >> >>
>>> > > >> >> >> - Cham
>>> > > >> >> >>
>>> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>>> > > sniemitz@apache.org>
>>> > > >> >> wrote:
>>> > > >> >> >>
>>> > > >> >> >> > I also have a similar use case (but with BigTable) that I
>>> feel
>>> > > >> like I
>>> > > >> >> had
>>> > > >> >> >> > to hack up to make work.  It'd be great to hear if there
>>> is a
>>> > > way
>>> > > >> to
>>> > > >> >> do
>>> > > >> >> >> > something like this already, or if there are plans in the
>>> > > future.
>>> > > >> >> >> >
>>> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>>> > chaim@behalf.com
>>> > > >
>>> > > >> >> wrote:
>>> > > >> >> >> >
>>> > > >> >> >> > > Hi,
>>> > > >> >> >> > >   I have a few piplines that are an ETL from different
>>> > > systems to
>>> > > >> >> >> > bigquery.
>>> > > >> >> >> > > I would like to write the status of the ETL after all
>>> > records
>>> > > >> have
>>> > > >> >> >> > > been updated to the bigquery.
>>> > > >> >> >> > > The problem is that writing to bigquery is a sink and
>>> you
>>> > > cannot
>>> > > >> >> have
>>> > > >> >> >> > > any other steps after the sink.
>>> > > >> >> >> > > I tried a sideoutput, but this is called in no
>>> correlation
>>> > to
>>> > > the
>>> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>>> > > failed.
>>> > > >> >> >> > >
>>> > > >> >> >> > >
>>> > > >> >> >> > > any ideas?
>>> > > >> >> >> > > chaim
>>> > > >> >> >> > >
>>> > > >> >> >> >
>>> > > >> >> >>
>>> > > >> >>
>>> > > >>
>>> > >
>>> >
>>>
>>
>

Re: writing status

Posted by Reuven Lax <re...@google.com.INVALID>.

How would you know how many Voids to wait for downstream?

On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <ki...@google.com>
wrote:

> Hi Steve,
> Unfortunately for BigQuery it's more complicated than that. Rows aren't
> written to BigQuery one by one (unless you're using streaming inserts,
> which are way more expensive and are usually used only in streaming
> pipelines) - they are written to files, and then a BigQuery import job, or
> several import jobs if there are too many files, picks them up. We can
> declare writing complete when all of the BigQuery import jobs have
> successfully completed.
> However, the method of writing is an implementation detail of BigQuery, so
> we need to create an API that works regardless of the method (import jobs
> vs. streaming inserts).
> Another complication is triggering - windows can fire multiple times. This
> rules out any approaches that sequence using side inputs, because side
> inputs don't have triggering.
>
> I think a common approach could be to return a PCollection<Void>,
> containing a Void in every window and pane that has been successfully
> written. This could be implemented in both modes and could be a general
> design patterns for this sort of thing. It just isn't easy to implement, so
> I didn't have time to take it on. It also could turn out to have other
> complications we haven't thought of yet.
>
> That said, if somebody tried to implement this for some connectors (not
> necessarily BigQuery) and pioneered the approach, it would be a great
> contribution.
>
> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org> wrote:
>
>> I wonder if it makes sense to start simple and go from there.  For
>> example,
>> I enhanced BigtableIO.Write to output the number of rows written
>> in finishBundle(), simply into the global window with the current
>> timestamp.  This was more than enough to unblock me, but doesn't support
>> more complicated scenarios with windowing.
>>
>> However, as I said it was more than enough to solve the general batch use
>> case, and I imagine could be enhanced to support windowing by keeping
>> track
>> of which windows were written per bundle. (can there even ever be more
>> than
>> one window per bundle?)
>>
>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>> kirpichov@google.com.invalid> wrote:
>>
>> > Hi,
>> > I was going to implement this, but discussed it with +Reuven Lax
>> > <re...@google.com> and it appears to be quite difficult to do
>> properly, or
>> > even to define what it means at all, especially if you're using the
>> > streaming inserts write method. So for now there is no workaround except
>> > programmatically waiting for your whole pipeline to finish
>> > (pipeline.run().waitUntilFinish()).
>> >
>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> > > is there a way around this for now?
>> > > how can i get a snapshot version?
>> > >
>> > > chaim
>> > >
>> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>> > > <ki...@google.com.invalid> wrote:
>> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>> > > >
>> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
>> wrote:
>> > > >
>> > > >> WriteResult does not support apply -> that is the problem
>> > > >>
>> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> > > >> <ki...@google.com.invalid> wrote:
>> > > >> > Hi,
>> > > >> >
>> > > >> > Sorry for the delay. So sounds like you want to do something
>> after
>> > > >> writing
>> > > >> > a window of data to BigQuery is complete.
>> > > >> > I think this should be possible: expansion of BigQueryIO.write()
>> > > returns
>> > > >> a
>> > > >> > WriteResult and you can apply other transforms to it. Have you
>> tried
>> > > >> that?
>> > > >> >
>> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
>> > > wrote:
>> > > >> >
>> > > >> >> I have documents from a mongo db that i need to migrate to
>> > bigquery.
>> > > >> >> Since it is mongodb i do not know they schema ahead of time, so
>> i
>> > > have
>> > > >> >> two pipelines, one to run over the documents and update the
>> > bigquery
>> > > >> >> schema, then wait a few minutes (i can take for bigquery to be
>> able
>> > > to
>> > > >> >> use the new schema) then with the other pipline copy all the
>> > > >> >> documents.
>> > > >> >> To know as to where i got with the different piplines i have a
>> > status
>> > > >> >> table so that at the start i know from where to continue.
>> > > >> >> So i need the option to update the status table with the
>> success of
>> > > >> >> the copy and some time value of the last copied document
>> > > >> >>
>> > > >> >>
>> > > >> >> chaim
>> > > >> >>
>> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> > > >> >> <ki...@google.com.invalid> wrote:
>> > > >> >> > I'd like to know more about your both use cases, can you
>> > clarify? I
>> > > >> think
>> > > >> >> > making sinks output something that can be waited on by another
>> > > >> pipeline
>> > > >> >> > step is a reasonable request, but more details would help
>> refine
>> > > this
>> > > >> >> > suggestion.
>> > > >> >> >
>> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> > > >> chamikara@apache.org>
>> > > >> >> > wrote:
>> > > >> >> >
>> > > >> >> >> Can you do this from the program that runs the Beam job,
>> after
>> > > job is
>> > > >> >> >> complete (you might have to use a blocking runner or poll for
>> > the
>> > > >> >> status of
>> > > >> >> >> the job) ?
>> > > >> >> >>
>> > > >> >> >> - Cham
>> > > >> >> >>
>> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>> > > sniemitz@apache.org>
>> > > >> >> wrote:
>> > > >> >> >>
>> > > >> >> >> > I also have a similar use case (but with BigTable) that I
>> feel
>> > > >> like I
>> > > >> >> had
>> > > >> >> >> > to hack up to make work.  It'd be great to hear if there
>> is a
>> > > way
>> > > >> to
>> > > >> >> do
>> > > >> >> >> > something like this already, or if there are plans in the
>> > > future.
>> > > >> >> >> >
>> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>> > chaim@behalf.com
>> > > >
>> > > >> >> wrote:
>> > > >> >> >> >
>> > > >> >> >> > > Hi,
>> > > >> >> >> > >   I have a few piplines that are an ETL from different
>> > > systems to
>> > > >> >> >> > bigquery.
>> > > >> >> >> > > I would like to write the status of the ETL after all
>> > records
>> > > >> have
>> > > >> >> >> > > been updated to the bigquery.
>> > > >> >> >> > > The problem is that writing to bigquery is a sink and you
>> > > cannot
>> > > >> >> have
>> > > >> >> >> > > any other steps after the sink.
>> > > >> >> >> > > I tried a sideoutput, but this is called in no
>> correlation
>> > to
>> > > the
>> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>> > > failed.
>> > > >> >> >> > >
>> > > >> >> >> > >
>> > > >> >> >> > > any ideas?
>> > > >> >> >> > > chaim
>> > > >> >> >> > >
>> > > >> >> >> >
>> > > >> >> >>
>> > > >> >>
>> > > >>
>> > >
>> >
>>
>

Re: writing status

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi Steve,
Unfortunately for BigQuery it's more complicated than that. Rows aren't
written to BigQuery one by one (unless you're using streaming inserts,
which are way more expensive and are usually used only in streaming
pipelines) - they are written to files, and then a BigQuery import job, or
several import jobs if there are too many files, picks them up. We can
declare writing complete when all of the BigQuery import jobs have
successfully completed.
However, the method of writing is an implementation detail of BigQuery, so
we need to create an API that works regardless of the method (import jobs
vs. streaming inserts).
Another complication is triggering - windows can fire multiple times. This
rules out any approaches that sequence using side inputs, because side
inputs don't have triggering.

I think a common approach could be to return a PCollection<Void>,
containing a Void in every window and pane that has been successfully
written. This could be implemented in both modes and could be a general
design patterns for this sort of thing. It just isn't easy to implement, so
I didn't have time to take it on. It also could turn out to have other
complications we haven't thought of yet.

That said, if somebody tried to implement this for some connectors (not
necessarily BigQuery) and pioneered the approach, it would be a great
contribution.

On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <sn...@apache.org> wrote:

> I wonder if it makes sense to start simple and go from there.  For example,
> I enhanced BigtableIO.Write to output the number of rows written
> in finishBundle(), simply into the global window with the current
> timestamp.  This was more than enough to unblock me, but doesn't support
> more complicated scenarios with windowing.
>
> However, as I said it was more than enough to solve the general batch use
> case, and I imagine could be enhanced to support windowing by keeping track
> of which windows were written per bundle. (can there even ever be more than
> one window per bundle?)
>
> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
> > Hi,
> > I was going to implement this, but discussed it with +Reuven Lax
> > <re...@google.com> and it appears to be quite difficult to do properly,
> or
> > even to define what it means at all, especially if you're using the
> > streaming inserts write method. So for now there is no workaround except
> > programmatically waiting for your whole pipeline to finish
> > (pipeline.run().waitUntilFinish()).
> >
> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
> >
> > > is there a way around this for now?
> > > how can i get a snapshot version?
> > >
> > > chaim
> > >
> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > > <ki...@google.com.invalid> wrote:
> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > > >
> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
> wrote:
> > > >
> > > >> WriteResult does not support apply -> that is the problem
> > > >>
> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > > >> <ki...@google.com.invalid> wrote:
> > > >> > Hi,
> > > >> >
> > > >> > Sorry for the delay. So sounds like you want to do something after
> > > >> writing
> > > >> > a window of data to BigQuery is complete.
> > > >> > I think this should be possible: expansion of BigQueryIO.write()
> > > returns
> > > >> a
> > > >> > WriteResult and you can apply other transforms to it. Have you
> tried
> > > >> that?
> > > >> >
> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
> > > wrote:
> > > >> >
> > > >> >> I have documents from a mongo db that i need to migrate to
> > bigquery.
> > > >> >> Since it is mongodb i do not know they schema ahead of time, so i
> > > have
> > > >> >> two pipelines, one to run over the documents and update the
> > bigquery
> > > >> >> schema, then wait a few minutes (i can take for bigquery to be
> able
> > > to
> > > >> >> use the new schema) then with the other pipline copy all the
> > > >> >> documents.
> > > >> >> To know as to where i got with the different piplines i have a
> > status
> > > >> >> table so that at the start i know from where to continue.
> > > >> >> So i need the option to update the status table with the success
> of
> > > >> >> the copy and some time value of the last copied document
> > > >> >>
> > > >> >>
> > > >> >> chaim
> > > >> >>
> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > > >> >> <ki...@google.com.invalid> wrote:
> > > >> >> > I'd like to know more about your both use cases, can you
> > clarify? I
> > > >> think
> > > >> >> > making sinks output something that can be waited on by another
> > > >> pipeline
> > > >> >> > step is a reasonable request, but more details would help
> refine
> > > this
> > > >> >> > suggestion.
> > > >> >> >
> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > > >> chamikara@apache.org>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> >> Can you do this from the program that runs the Beam job, after
> > > job is
> > > >> >> >> complete (you might have to use a blocking runner or poll for
> > the
> > > >> >> status of
> > > >> >> >> the job) ?
> > > >> >> >>
> > > >> >> >> - Cham
> > > >> >> >>
> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > > sniemitz@apache.org>
> > > >> >> wrote:
> > > >> >> >>
> > > >> >> >> > I also have a similar use case (but with BigTable) that I
> feel
> > > >> like I
> > > >> >> had
> > > >> >> >> > to hack up to make work.  It'd be great to hear if there is
> a
> > > way
> > > >> to
> > > >> >> do
> > > >> >> >> > something like this already, or if there are plans in the
> > > future.
> > > >> >> >> >
> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > chaim@behalf.com
> > > >
> > > >> >> wrote:
> > > >> >> >> >
> > > >> >> >> > > Hi,
> > > >> >> >> > >   I have a few piplines that are an ETL from different
> > > systems to
> > > >> >> >> > bigquery.
> > > >> >> >> > > I would like to write the status of the ETL after all
> > records
> > > >> have
> > > >> >> >> > > been updated to the bigquery.
> > > >> >> >> > > The problem is that writing to bigquery is a sink and you
> > > cannot
> > > >> >> have
> > > >> >> >> > > any other steps after the sink.
> > > >> >> >> > > I tried a sideoutput, but this is called in no correlation
> > to
> > > the
> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > > failed.
> > > >> >> >> > >
> > > >> >> >> > >
> > > >> >> >> > > any ideas?
> > > >> >> >> > > chaim
> > > >> >> >> > >
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
>

Re: writing status

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Would like to elaborate why getFailedInserts is easier to implement than
returning a sequenceable indicator of success.

getFailedInserts returns individual rows. In a streaming pipeline, it will
return them one by one, as each individual row fails [in a batch pipeline -
or in a streaming pipeline but with BigQueryIO.write() configured to use
the BATCH_LOADS write method - as Reuven said, it will be empty].

This is reasonable because we assume that the number of failed rows will be
small, and because the exact contents of the failed rows is valuable: we
want to preserve them for further investigation to avoid data loss.

Returning individual "succeeded" rows would be inefficient and rather odd,
because it would in most cases simply duplicate the input PCollection
(which can also be very large, so returning it would make the pipeline much
slower), and in all cases the caller would ignore the contents of the rows.

I still believe the right approach is PCollection<Void> with windows and
panes as suggested above. It's not easy to implement, especially with
streaming inserts, but I think it gives the right
batch/streaming-independent semantics and should be doable. Will keep
thinking.

On Sun, Sep 10, 2017 at 8:59 AM Reuven Lax <re...@google.com.invalid> wrote:

> No, windowing is supported on BigQuery.
>
> The BigQueryIO transform is divided into two parts. The first part takes
> your input elements and decides which tables they should be written to.
> This can be based on the windows that the elements are in (as well as the
> data in the elements themselves if you want).
>
> The second part of the transform actually writes the data to BigQuery.
> Since we've already used the windowing information to decide which tables
> to write the elements to, this second part does not need windowing
> information any more. It is written in terms of Beam primitives (e.g.
> GroupByKey) that we don't want affected by windowing; at this point we want
> to write the data to the target BigQuery tables as fast as possible. So
> this second half of the transform rewindows all the data in GlobalWindows
> before writing to BigQuery.
>
> As for getFailedInserts, it only is supported for the streaming path. The
> reason why is that it does not make sense for the batch path. The batch
> path works by writing all elements to files in GCS, and then telling
> BigQuery to import all these files. If there's a failure, we don't get any
> information about which row failed, all we know is that the entire import
> job failed. In streaming on the other hand we insert the rows one by one,
> and when there's a failure we get detailed information on which row failed
> and can return it in getFailedInserts.
>
> Reuven
>
> On Sat, Sep 9, 2017 at 10:25 PM, Chaim Turkel <ch...@behalf.com> wrote:
>
> > so what you are saying is that windowing is not supported on the
> > bigquery? this does not make sense since i am using it for the table
> > partition, and that works fine?
> >
> > On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <sn...@apache.org>
> wrote:
> > > I wonder if it makes sense to start simple and go from there.  For
> > example,
> > > I enhanced BigtableIO.Write to output the number of rows written
> > > in finishBundle(), simply into the global window with the current
> > > timestamp.  This was more than enough to unblock me, but doesn't
> support
> > > more complicated scenarios with windowing.
> > >
> > > However, as I said it was more than enough to solve the general batch
> use
> > > case, and I imagine could be enhanced to support windowing by keeping
> > track
> > > of which windows were written per bundle. (can there even ever be more
> > than
> > > one window per bundle?)
> > >
> > > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > > kirpichov@google.com.invalid> wrote:
> > >
> > >> Hi,
> > >> I was going to implement this, but discussed it with +Reuven Lax
> > >> <re...@google.com> and it appears to be quite difficult to do
> > properly, or
> > >> even to define what it means at all, especially if you're using the
> > >> streaming inserts write method. So for now there is no workaround
> except
> > >> programmatically waiting for your whole pipeline to finish
> > >> (pipeline.run().waitUntilFinish()).
> > >>
> > >> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
> > >>
> > >> > is there a way around this for now?
> > >> > how can i get a snapshot version?
> > >> >
> > >> > chaim
> > >> >
> > >> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > >> > <ki...@google.com.invalid> wrote:
> > >> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > >> > >
> > >> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
> > wrote:
> > >> > >
> > >> > >> WriteResult does not support apply -> that is the problem
> > >> > >>
> > >> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > >> > >> <ki...@google.com.invalid> wrote:
> > >> > >> > Hi,
> > >> > >> >
> > >> > >> > Sorry for the delay. So sounds like you want to do something
> > after
> > >> > >> writing
> > >> > >> > a window of data to BigQuery is complete.
> > >> > >> > I think this should be possible: expansion of
> BigQueryIO.write()
> > >> > returns
> > >> > >> a
> > >> > >> > WriteResult and you can apply other transforms to it. Have you
> > tried
> > >> > >> that?
> > >> > >> >
> > >> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <chaim@behalf.com
> >
> > >> > wrote:
> > >> > >> >
> > >> > >> >> I have documents from a mongo db that i need to migrate to
> > >> bigquery.
> > >> > >> >> Since it is mongodb i do not know they schema ahead of time,
> so
> > i
> > >> > have
> > >> > >> >> two pipelines, one to run over the documents and update the
> > >> bigquery
> > >> > >> >> schema, then wait a few minutes (i can take for bigquery to be
> > able
> > >> > to
> > >> > >> >> use the new schema) then with the other pipline copy all the
> > >> > >> >> documents.
> > >> > >> >> To know as to where i got with the different piplines i have a
> > >> status
> > >> > >> >> table so that at the start i know from where to continue.
> > >> > >> >> So i need the option to update the status table with the
> > success of
> > >> > >> >> the copy and some time value of the last copied document
> > >> > >> >>
> > >> > >> >>
> > >> > >> >> chaim
> > >> > >> >>
> > >> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > >> > >> >> <ki...@google.com.invalid> wrote:
> > >> > >> >> > I'd like to know more about your both use cases, can you
> > >> clarify? I
> > >> > >> think
> > >> > >> >> > making sinks output something that can be waited on by
> another
> > >> > >> pipeline
> > >> > >> >> > step is a reasonable request, but more details would help
> > refine
> > >> > this
> > >> > >> >> > suggestion.
> > >> > >> >> >
> > >> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > >> > >> chamikara@apache.org>
> > >> > >> >> > wrote:
> > >> > >> >> >
> > >> > >> >> >> Can you do this from the program that runs the Beam job,
> > after
> > >> > job is
> > >> > >> >> >> complete (you might have to use a blocking runner or poll
> for
> > >> the
> > >> > >> >> status of
> > >> > >> >> >> the job) ?
> > >> > >> >> >>
> > >> > >> >> >> - Cham
> > >> > >> >> >>
> > >> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > >> > sniemitz@apache.org>
> > >> > >> >> wrote:
> > >> > >> >> >>
> > >> > >> >> >> > I also have a similar use case (but with BigTable) that I
> > feel
> > >> > >> like I
> > >> > >> >> had
> > >> > >> >> >> > to hack up to make work.  It'd be great to hear if there
> > is a
> > >> > way
> > >> > >> to
> > >> > >> >> do
> > >> > >> >> >> > something like this already, or if there are plans in the
> > >> > future.
> > >> > >> >> >> >
> > >> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > >> chaim@behalf.com
> > >> > >
> > >> > >> >> wrote:
> > >> > >> >> >> >
> > >> > >> >> >> > > Hi,
> > >> > >> >> >> > >   I have a few piplines that are an ETL from different
> > >> > systems to
> > >> > >> >> >> > bigquery.
> > >> > >> >> >> > > I would like to write the status of the ETL after all
> > >> records
> > >> > >> have
> > >> > >> >> >> > > been updated to the bigquery.
> > >> > >> >> >> > > The problem is that writing to bigquery is a sink and
> you
> > >> > cannot
> > >> > >> >> have
> > >> > >> >> >> > > any other steps after the sink.
> > >> > >> >> >> > > I tried a sideoutput, but this is called in no
> > correlation
> > >> to
> > >> > the
> > >> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > >> > failed.
> > >> > >> >> >> > >
> > >> > >> >> >> > >
> > >> > >> >> >> > > any ideas?
> > >> > >> >> >> > > chaim
> > >> > >> >> >> > >
> > >> > >> >> >> >
> > >> > >> >> >>
> > >> > >> >>
> > >> > >>
> > >> >
> > >>
> >
>

Re: writing status

Posted by Reuven Lax <re...@google.com.INVALID>.

No, windowing is supported on BigQuery.

The BigQueryIO transform is divided into two parts. The first part takes
your input elements and decides which tables they should be written to.
This can be based on the windows that the elements are in (as well as the
data in the elements themselves if you want).

The second part of the transform actually writes the data to BigQuery.
Since we've already used the windowing information to decide which tables
to write the elements to, this second part does not need windowing
information any more. It is written in terms of Beam primitives (e.g.
GroupByKey) that we don't want affected by windowing; at this point we want
to write the data to the target BigQuery tables as fast as possible. So
this second half of the transform rewindows all the data in GlobalWindows
before writing to BigQuery.

As for getFailedInserts, it only is supported for the streaming path. The
reason why is that it does not make sense for the batch path. The batch
path works by writing all elements to files in GCS, and then telling
BigQuery to import all these files. If there's a failure, we don't get any
information about which row failed, all we know is that the entire import
job failed. In streaming on the other hand we insert the rows one by one,
and when there's a failure we get detailed information on which row failed
and can return it in getFailedInserts.

Reuven

On Sat, Sep 9, 2017 at 10:25 PM, Chaim Turkel <ch...@behalf.com> wrote:

> so what you are saying is that windowing is not supported on the
> bigquery? this does not make sense since i am using it for the table
> partition, and that works fine?
>
> On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <sn...@apache.org> wrote:
> > I wonder if it makes sense to start simple and go from there.  For
> example,
> > I enhanced BigtableIO.Write to output the number of rows written
> > in finishBundle(), simply into the global window with the current
> > timestamp.  This was more than enough to unblock me, but doesn't support
> > more complicated scenarios with windowing.
> >
> > However, as I said it was more than enough to solve the general batch use
> > case, and I imagine could be enhanced to support windowing by keeping
> track
> > of which windows were written per bundle. (can there even ever be more
> than
> > one window per bundle?)
> >
> > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > kirpichov@google.com.invalid> wrote:
> >
> >> Hi,
> >> I was going to implement this, but discussed it with +Reuven Lax
> >> <re...@google.com> and it appears to be quite difficult to do
> properly, or
> >> even to define what it means at all, especially if you're using the
> >> streaming inserts write method. So for now there is no workaround except
> >> programmatically waiting for your whole pipeline to finish
> >> (pipeline.run().waitUntilFinish()).
> >>
> >> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
> >>
> >> > is there a way around this for now?
> >> > how can i get a snapshot version?
> >> >
> >> > chaim
> >> >
> >> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> >> > <ki...@google.com.invalid> wrote:
> >> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> >> > >
> >> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com>
> wrote:
> >> > >
> >> > >> WriteResult does not support apply -> that is the problem
> >> > >>
> >> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> >> > >> <ki...@google.com.invalid> wrote:
> >> > >> > Hi,
> >> > >> >
> >> > >> > Sorry for the delay. So sounds like you want to do something
> after
> >> > >> writing
> >> > >> > a window of data to BigQuery is complete.
> >> > >> > I think this should be possible: expansion of BigQueryIO.write()
> >> > returns
> >> > >> a
> >> > >> > WriteResult and you can apply other transforms to it. Have you
> tried
> >> > >> that?
> >> > >> >
> >> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
> >> > wrote:
> >> > >> >
> >> > >> >> I have documents from a mongo db that i need to migrate to
> >> bigquery.
> >> > >> >> Since it is mongodb i do not know they schema ahead of time, so
> i
> >> > have
> >> > >> >> two pipelines, one to run over the documents and update the
> >> bigquery
> >> > >> >> schema, then wait a few minutes (i can take for bigquery to be
> able
> >> > to
> >> > >> >> use the new schema) then with the other pipline copy all the
> >> > >> >> documents.
> >> > >> >> To know as to where i got with the different piplines i have a
> >> status
> >> > >> >> table so that at the start i know from where to continue.
> >> > >> >> So i need the option to update the status table with the
> success of
> >> > >> >> the copy and some time value of the last copied document
> >> > >> >>
> >> > >> >>
> >> > >> >> chaim
> >> > >> >>
> >> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> >> > >> >> <ki...@google.com.invalid> wrote:
> >> > >> >> > I'd like to know more about your both use cases, can you
> >> clarify? I
> >> > >> think
> >> > >> >> > making sinks output something that can be waited on by another
> >> > >> pipeline
> >> > >> >> > step is a reasonable request, but more details would help
> refine
> >> > this
> >> > >> >> > suggestion.
> >> > >> >> >
> >> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> >> > >> chamikara@apache.org>
> >> > >> >> > wrote:
> >> > >> >> >
> >> > >> >> >> Can you do this from the program that runs the Beam job,
> after
> >> > job is
> >> > >> >> >> complete (you might have to use a blocking runner or poll for
> >> the
> >> > >> >> status of
> >> > >> >> >> the job) ?
> >> > >> >> >>
> >> > >> >> >> - Cham
> >> > >> >> >>
> >> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> >> > sniemitz@apache.org>
> >> > >> >> wrote:
> >> > >> >> >>
> >> > >> >> >> > I also have a similar use case (but with BigTable) that I
> feel
> >> > >> like I
> >> > >> >> had
> >> > >> >> >> > to hack up to make work.  It'd be great to hear if there
> is a
> >> > way
> >> > >> to
> >> > >> >> do
> >> > >> >> >> > something like this already, or if there are plans in the
> >> > future.
> >> > >> >> >> >
> >> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> >> chaim@behalf.com
> >> > >
> >> > >> >> wrote:
> >> > >> >> >> >
> >> > >> >> >> > > Hi,
> >> > >> >> >> > >   I have a few piplines that are an ETL from different
> >> > systems to
> >> > >> >> >> > bigquery.
> >> > >> >> >> > > I would like to write the status of the ETL after all
> >> records
> >> > >> have
> >> > >> >> >> > > been updated to the bigquery.
> >> > >> >> >> > > The problem is that writing to bigquery is a sink and you
> >> > cannot
> >> > >> >> have
> >> > >> >> >> > > any other steps after the sink.
> >> > >> >> >> > > I tried a sideoutput, but this is called in no
> correlation
> >> to
> >> > the
> >> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> >> > failed.
> >> > >> >> >> > >
> >> > >> >> >> > >
> >> > >> >> >> > > any ideas?
> >> > >> >> >> > > chaim
> >> > >> >> >> > >
> >> > >> >> >> >
> >> > >> >> >>
> >> > >> >>
> >> > >>
> >> >
> >>
>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

so what you are saying is that windowing is not supported on the
bigquery? this does not make sense since i am using it for the table
partition, and that works fine?

On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <sn...@apache.org> wrote:
> I wonder if it makes sense to start simple and go from there.  For example,
> I enhanced BigtableIO.Write to output the number of rows written
> in finishBundle(), simply into the global window with the current
> timestamp.  This was more than enough to unblock me, but doesn't support
> more complicated scenarios with windowing.
>
> However, as I said it was more than enough to solve the general batch use
> case, and I imagine could be enhanced to support windowing by keeping track
> of which windows were written per bundle. (can there even ever be more than
> one window per bundle?)
>
> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
>> Hi,
>> I was going to implement this, but discussed it with +Reuven Lax
>> <re...@google.com> and it appears to be quite difficult to do properly, or
>> even to define what it means at all, especially if you're using the
>> streaming inserts write method. So for now there is no workaround except
>> programmatically waiting for your whole pipeline to finish
>> (pipeline.run().waitUntilFinish()).
>>
>> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>>
>> > is there a way around this for now?
>> > how can i get a snapshot version?
>> >
>> > chaim
>> >
>> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>> > <ki...@google.com.invalid> wrote:
>> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>> > >
>> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>> > >
>> > >> WriteResult does not support apply -> that is the problem
>> > >>
>> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> > >> <ki...@google.com.invalid> wrote:
>> > >> > Hi,
>> > >> >
>> > >> > Sorry for the delay. So sounds like you want to do something after
>> > >> writing
>> > >> > a window of data to BigQuery is complete.
>> > >> > I think this should be possible: expansion of BigQueryIO.write()
>> > returns
>> > >> a
>> > >> > WriteResult and you can apply other transforms to it. Have you tried
>> > >> that?
>> > >> >
>> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
>> > wrote:
>> > >> >
>> > >> >> I have documents from a mongo db that i need to migrate to
>> bigquery.
>> > >> >> Since it is mongodb i do not know they schema ahead of time, so i
>> > have
>> > >> >> two pipelines, one to run over the documents and update the
>> bigquery
>> > >> >> schema, then wait a few minutes (i can take for bigquery to be able
>> > to
>> > >> >> use the new schema) then with the other pipline copy all the
>> > >> >> documents.
>> > >> >> To know as to where i got with the different piplines i have a
>> status
>> > >> >> table so that at the start i know from where to continue.
>> > >> >> So i need the option to update the status table with the success of
>> > >> >> the copy and some time value of the last copied document
>> > >> >>
>> > >> >>
>> > >> >> chaim
>> > >> >>
>> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> > >> >> <ki...@google.com.invalid> wrote:
>> > >> >> > I'd like to know more about your both use cases, can you
>> clarify? I
>> > >> think
>> > >> >> > making sinks output something that can be waited on by another
>> > >> pipeline
>> > >> >> > step is a reasonable request, but more details would help refine
>> > this
>> > >> >> > suggestion.
>> > >> >> >
>> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> > >> chamikara@apache.org>
>> > >> >> > wrote:
>> > >> >> >
>> > >> >> >> Can you do this from the program that runs the Beam job, after
>> > job is
>> > >> >> >> complete (you might have to use a blocking runner or poll for
>> the
>> > >> >> status of
>> > >> >> >> the job) ?
>> > >> >> >>
>> > >> >> >> - Cham
>> > >> >> >>
>> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>> > sniemitz@apache.org>
>> > >> >> wrote:
>> > >> >> >>
>> > >> >> >> > I also have a similar use case (but with BigTable) that I feel
>> > >> like I
>> > >> >> had
>> > >> >> >> > to hack up to make work.  It'd be great to hear if there is a
>> > way
>> > >> to
>> > >> >> do
>> > >> >> >> > something like this already, or if there are plans in the
>> > future.
>> > >> >> >> >
>> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>> chaim@behalf.com
>> > >
>> > >> >> wrote:
>> > >> >> >> >
>> > >> >> >> > > Hi,
>> > >> >> >> > >   I have a few piplines that are an ETL from different
>> > systems to
>> > >> >> >> > bigquery.
>> > >> >> >> > > I would like to write the status of the ETL after all
>> records
>> > >> have
>> > >> >> >> > > been updated to the bigquery.
>> > >> >> >> > > The problem is that writing to bigquery is a sink and you
>> > cannot
>> > >> >> have
>> > >> >> >> > > any other steps after the sink.
>> > >> >> >> > > I tried a sideoutput, but this is called in no correlation
>> to
>> > the
>> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>> > failed.
>> > >> >> >> > >
>> > >> >> >> > >
>> > >> >> >> > > any ideas?
>> > >> >> >> > > chaim
>> > >> >> >> > >
>> > >> >> >> >
>> > >> >> >>
>> > >> >>
>> > >>
>> >
>>

Re: writing status

Posted by Steve Niemitz <sn...@apache.org>.

I wonder if it makes sense to start simple and go from there.  For example,
I enhanced BigtableIO.Write to output the number of rows written
in finishBundle(), simply into the global window with the current
timestamp.  This was more than enough to unblock me, but doesn't support
more complicated scenarios with windowing.

However, as I said it was more than enough to solve the general batch use
case, and I imagine could be enhanced to support windowing by keeping track
of which windows were written per bundle. (can there even ever be more than
one window per bundle?)

On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> Hi,
> I was going to implement this, but discussed it with +Reuven Lax
> <re...@google.com> and it appears to be quite difficult to do properly, or
> even to define what it means at all, especially if you're using the
> streaming inserts write method. So for now there is no workaround except
> programmatically waiting for your whole pipeline to finish
> (pipeline.run().waitUntilFinish()).
>
> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:
>
> > is there a way around this for now?
> > how can i get a snapshot version?
> >
> > chaim
> >
> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > >
> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
> > >
> > >> WriteResult does not support apply -> that is the problem
> > >>
> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > >> <ki...@google.com.invalid> wrote:
> > >> > Hi,
> > >> >
> > >> > Sorry for the delay. So sounds like you want to do something after
> > >> writing
> > >> > a window of data to BigQuery is complete.
> > >> > I think this should be possible: expansion of BigQueryIO.write()
> > returns
> > >> a
> > >> > WriteResult and you can apply other transforms to it. Have you tried
> > >> that?
> > >> >
> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
> > wrote:
> > >> >
> > >> >> I have documents from a mongo db that i need to migrate to
> bigquery.
> > >> >> Since it is mongodb i do not know they schema ahead of time, so i
> > have
> > >> >> two pipelines, one to run over the documents and update the
> bigquery
> > >> >> schema, then wait a few minutes (i can take for bigquery to be able
> > to
> > >> >> use the new schema) then with the other pipline copy all the
> > >> >> documents.
> > >> >> To know as to where i got with the different piplines i have a
> status
> > >> >> table so that at the start i know from where to continue.
> > >> >> So i need the option to update the status table with the success of
> > >> >> the copy and some time value of the last copied document
> > >> >>
> > >> >>
> > >> >> chaim
> > >> >>
> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > >> >> <ki...@google.com.invalid> wrote:
> > >> >> > I'd like to know more about your both use cases, can you
> clarify? I
> > >> think
> > >> >> > making sinks output something that can be waited on by another
> > >> pipeline
> > >> >> > step is a reasonable request, but more details would help refine
> > this
> > >> >> > suggestion.
> > >> >> >
> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > >> chamikara@apache.org>
> > >> >> > wrote:
> > >> >> >
> > >> >> >> Can you do this from the program that runs the Beam job, after
> > job is
> > >> >> >> complete (you might have to use a blocking runner or poll for
> the
> > >> >> status of
> > >> >> >> the job) ?
> > >> >> >>
> > >> >> >> - Cham
> > >> >> >>
> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > sniemitz@apache.org>
> > >> >> wrote:
> > >> >> >>
> > >> >> >> > I also have a similar use case (but with BigTable) that I feel
> > >> like I
> > >> >> had
> > >> >> >> > to hack up to make work.  It'd be great to hear if there is a
> > way
> > >> to
> > >> >> do
> > >> >> >> > something like this already, or if there are plans in the
> > future.
> > >> >> >> >
> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> chaim@behalf.com
> > >
> > >> >> wrote:
> > >> >> >> >
> > >> >> >> > > Hi,
> > >> >> >> > >   I have a few piplines that are an ETL from different
> > systems to
> > >> >> >> > bigquery.
> > >> >> >> > > I would like to write the status of the ETL after all
> records
> > >> have
> > >> >> >> > > been updated to the bigquery.
> > >> >> >> > > The problem is that writing to bigquery is a sink and you
> > cannot
> > >> >> have
> > >> >> >> > > any other steps after the sink.
> > >> >> >> > > I tried a sideoutput, but this is called in no correlation
> to
> > the
> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > failed.
> > >> >> >> > >
> > >> >> >> > >
> > >> >> >> > > any ideas?
> > >> >> >> > > chaim
> > >> >> >> > >
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
>

Re: writing status

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi,
I was going to implement this, but discussed it with +Reuven Lax
<re...@google.com> and it appears to be quite difficult to do properly, or
even to define what it means at all, especially if you're using the
streaming inserts write method. So for now there is no workaround except
programmatically waiting for your whole pipeline to finish
(pipeline.run().waitUntilFinish()).

On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <ch...@behalf.com> wrote:

> is there a way around this for now?
> how can i get a snapshot version?
>
> chaim
>
> On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
> > Oh I see! Okay, this should be easy to fix. I'll take a look.
> >
> > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> WriteResult does not support apply -> that is the problem
> >>
> >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> >> <ki...@google.com.invalid> wrote:
> >> > Hi,
> >> >
> >> > Sorry for the delay. So sounds like you want to do something after
> >> writing
> >> > a window of data to BigQuery is complete.
> >> > I think this should be possible: expansion of BigQueryIO.write()
> returns
> >> a
> >> > WriteResult and you can apply other transforms to it. Have you tried
> >> that?
> >> >
> >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com>
> wrote:
> >> >
> >> >> I have documents from a mongo db that i need to migrate to bigquery.
> >> >> Since it is mongodb i do not know they schema ahead of time, so i
> have
> >> >> two pipelines, one to run over the documents and update the bigquery
> >> >> schema, then wait a few minutes (i can take for bigquery to be able
> to
> >> >> use the new schema) then with the other pipline copy all the
> >> >> documents.
> >> >> To know as to where i got with the different piplines i have a status
> >> >> table so that at the start i know from where to continue.
> >> >> So i need the option to update the status table with the success of
> >> >> the copy and some time value of the last copied document
> >> >>
> >> >>
> >> >> chaim
> >> >>
> >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> >> >> <ki...@google.com.invalid> wrote:
> >> >> > I'd like to know more about your both use cases, can you clarify? I
> >> think
> >> >> > making sinks output something that can be waited on by another
> >> pipeline
> >> >> > step is a reasonable request, but more details would help refine
> this
> >> >> > suggestion.
> >> >> >
> >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> >> chamikara@apache.org>
> >> >> > wrote:
> >> >> >
> >> >> >> Can you do this from the program that runs the Beam job, after
> job is
> >> >> >> complete (you might have to use a blocking runner or poll for the
> >> >> status of
> >> >> >> the job) ?
> >> >> >>
> >> >> >> - Cham
> >> >> >>
> >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> sniemitz@apache.org>
> >> >> wrote:
> >> >> >>
> >> >> >> > I also have a similar use case (but with BigTable) that I feel
> >> like I
> >> >> had
> >> >> >> > to hack up to make work.  It'd be great to hear if there is a
> way
> >> to
> >> >> do
> >> >> >> > something like this already, or if there are plans in the
> future.
> >> >> >> >
> >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <chaim@behalf.com
> >
> >> >> wrote:
> >> >> >> >
> >> >> >> > > Hi,
> >> >> >> > >   I have a few piplines that are an ETL from different
> systems to
> >> >> >> > bigquery.
> >> >> >> > > I would like to write the status of the ETL after all records
> >> have
> >> >> >> > > been updated to the bigquery.
> >> >> >> > > The problem is that writing to bigquery is a sink and you
> cannot
> >> >> have
> >> >> >> > > any other steps after the sink.
> >> >> >> > > I tried a sideoutput, but this is called in no correlation to
> the
> >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> failed.
> >> >> >> > >
> >> >> >> > >
> >> >> >> > > any ideas?
> >> >> >> > > chaim
> >> >> >> > >
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

is there a way around this for now?
how can i get a snapshot version?

chaim

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <ki...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamikara@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

thanks, please keep me updated

On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> Oh I see! Okay, this should be easy to fix. I'll take a look.
>
> On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> WriteResult does not support apply -> that is the problem
>>
>> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>> > Hi,
>> >
>> > Sorry for the delay. So sounds like you want to do something after
>> writing
>> > a window of data to BigQuery is complete.
>> > I think this should be possible: expansion of BigQueryIO.write() returns
>> a
>> > WriteResult and you can apply other transforms to it. Have you tried
>> that?
>> >
>> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> I have documents from a mongo db that i need to migrate to bigquery.
>> >> Since it is mongodb i do not know they schema ahead of time, so i have
>> >> two pipelines, one to run over the documents and update the bigquery
>> >> schema, then wait a few minutes (i can take for bigquery to be able to
>> >> use the new schema) then with the other pipline copy all the
>> >> documents.
>> >> To know as to where i got with the different piplines i have a status
>> >> table so that at the start i know from where to continue.
>> >> So i need the option to update the status table with the success of
>> >> the copy and some time value of the last copied document
>> >>
>> >>
>> >> chaim
>> >>
>> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> >> <ki...@google.com.invalid> wrote:
>> >> > I'd like to know more about your both use cases, can you clarify? I
>> think
>> >> > making sinks output something that can be waited on by another
>> pipeline
>> >> > step is a reasonable request, but more details would help refine this
>> >> > suggestion.
>> >> >
>> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>> chamikara@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you do this from the program that runs the Beam job, after job is
>> >> >> complete (you might have to use a blocking runner or poll for the
>> >> status of
>> >> >> the job) ?
>> >> >>
>> >> >> - Cham
>> >> >>
>> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> > I also have a similar use case (but with BigTable) that I feel
>> like I
>> >> had
>> >> >> > to hack up to make work.  It'd be great to hear if there is a way
>> to
>> >> do
>> >> >> > something like this already, or if there are plans in the future.
>> >> >> >
>> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Hi,
>> >> >> > >   I have a few piplines that are an ETL from different systems to
>> >> >> > bigquery.
>> >> >> > > I would like to write the status of the ETL after all records
>> have
>> >> >> > > been updated to the bigquery.
>> >> >> > > The problem is that writing to bigquery is a sink and you cannot
>> >> have
>> >> >> > > any other steps after the sink.
>> >> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> >> > >
>> >> >> > >
>> >> >> > > any ideas?
>> >> >> > > chaim
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

Re: writing status

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Oh I see! Okay, this should be easy to fix. I'll take a look.

On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <ch...@behalf.com> wrote:

> WriteResult does not support apply -> that is the problem
>
> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
> > Hi,
> >
> > Sorry for the delay. So sounds like you want to do something after
> writing
> > a window of data to BigQuery is complete.
> > I think this should be possible: expansion of BigQueryIO.write() returns
> a
> > WriteResult and you can apply other transforms to it. Have you tried
> that?
> >
> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> I have documents from a mongo db that i need to migrate to bigquery.
> >> Since it is mongodb i do not know they schema ahead of time, so i have
> >> two pipelines, one to run over the documents and update the bigquery
> >> schema, then wait a few minutes (i can take for bigquery to be able to
> >> use the new schema) then with the other pipline copy all the
> >> documents.
> >> To know as to where i got with the different piplines i have a status
> >> table so that at the start i know from where to continue.
> >> So i need the option to update the status table with the success of
> >> the copy and some time value of the last copied document
> >>
> >>
> >> chaim
> >>
> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> >> <ki...@google.com.invalid> wrote:
> >> > I'd like to know more about your both use cases, can you clarify? I
> think
> >> > making sinks output something that can be waited on by another
> pipeline
> >> > step is a reasonable request, but more details would help refine this
> >> > suggestion.
> >> >
> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> chamikara@apache.org>
> >> > wrote:
> >> >
> >> >> Can you do this from the program that runs the Beam job, after job is
> >> >> complete (you might have to use a blocking runner or poll for the
> >> status of
> >> >> the job) ?
> >> >>
> >> >> - Cham
> >> >>
> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
> >> wrote:
> >> >>
> >> >> > I also have a similar use case (but with BigTable) that I feel
> like I
> >> had
> >> >> > to hack up to make work.  It'd be great to hear if there is a way
> to
> >> do
> >> >> > something like this already, or if there are plans in the future.
> >> >> >
> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
> >> wrote:
> >> >> >
> >> >> > > Hi,
> >> >> > >   I have a few piplines that are an ETL from different systems to
> >> >> > bigquery.
> >> >> > > I would like to write the status of the ETL after all records
> have
> >> >> > > been updated to the bigquery.
> >> >> > > The problem is that writing to bigquery is a sink and you cannot
> >> have
> >> >> > > any other steps after the sink.
> >> >> > > I tried a sideoutput, but this is called in no correlation to the
> >> >> > > writing to bigquery, so i don't know if it succeeded or failed.
> >> >> > >
> >> >> > >
> >> >> > > any ideas?
> >> >> > > chaim
> >> >> > >
> >> >> >
> >> >>
> >>
>

Re: writing status

Posted by Chaim Turkel <ch...@behalf.com>.

WriteResult does not support apply -> that is the problem

On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> Hi,
>
> Sorry for the delay. So sounds like you want to do something after writing
> a window of data to BigQuery is complete.
> I think this should be possible: expansion of BigQueryIO.write() returns a
> WriteResult and you can apply other transforms to it. Have you tried that?
>
> On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> I have documents from a mongo db that i need to migrate to bigquery.
>> Since it is mongodb i do not know they schema ahead of time, so i have
>> two pipelines, one to run over the documents and update the bigquery
>> schema, then wait a few minutes (i can take for bigquery to be able to
>> use the new schema) then with the other pipline copy all the
>> documents.
>> To know as to where i got with the different piplines i have a status
>> table so that at the start i know from where to continue.
>> So i need the option to update the status table with the success of
>> the copy and some time value of the last copied document
>>
>>
>> chaim
>>
>> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>> > I'd like to know more about your both use cases, can you clarify? I think
>> > making sinks output something that can be waited on by another pipeline
>> > step is a reasonable request, but more details would help refine this
>> > suggestion.
>> >
>> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <ch...@apache.org>
>> > wrote:
>> >
>> >> Can you do this from the program that runs the Beam job, after job is
>> >> complete (you might have to use a blocking runner or poll for the
>> status of
>> >> the job) ?
>> >>
>> >> - Cham
>> >>
>> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <sn...@apache.org>
>> wrote:
>> >>
>> >> > I also have a similar use case (but with BigTable) that I feel like I
>> had
>> >> > to hack up to make work.  It'd be great to hear if there is a way to
>> do
>> >> > something like this already, or if there are plans in the future.
>> >> >
>> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> > > Hi,
>> >> > >   I have a few piplines that are an ETL from different systems to
>> >> > bigquery.
>> >> > > I would like to write the status of the ETL after all records have
>> >> > > been updated to the bigquery.
>> >> > > The problem is that writing to bigquery is a sink and you cannot
>> have
>> >> > > any other steps after the sink.
>> >> > > I tried a sideoutput, but this is called in no correlation to the
>> >> > > writing to bigquery, so i don't know if it succeeded or failed.
>> >> > >
>> >> > >
>> >> > > any ideas?
>> >> > > chaim
>> >> > >
>> >> >
>> >>
>>