You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Pablo Estrada <pa...@google.com> on 2019/03/21 23:20:02 UTC

PipelineOptions at execution time from DirectRunner

Hi all,
The DirectRunner does not seem to support RuntimeValueProvider. Is there a
suggestion for DirectRunner pipelines to access arguments passed in as
pipeline options(but not necessarily passed explicitly by users) at
pipeline execution time?

Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
possiblity, but it seems like that's not ideal. Any other suggestions?

Should we simply support RuntimeValueProvider in direct runner?
Best
-P.

Re: PipelineOptions at execution time from DirectRunner

Posted by Ahmet Altay <al...@google.com>.
On Thu, Mar 21, 2019 at 5:29 PM Pablo Estrada <pa...@google.com> wrote:

> Thanks Ahmet! These are illustrative explanations.
>
> I still wonder about one question:
>
>>
>>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is
>>> a possiblity, but it seems like that's not ideal. Any other suggestions?
>>>
>> Is this an appropriate way of obtaining an option that is not explicitly
> passed by the user? It prints a warning.
>
> "What's the context?" - I'm working on a transform that writes to big
> query, and table destinations can come in the form "dataset.table", or
> "project:dataset.table". Because these are parsed at runtime (destinations
> are dynamic), the ptransform checks if a project was provided in
> pipelineoptions via value providers.
>

Got it. We wanted to discourage this use and deprecated it. But we have not
really offered a clean alternative solution either. I think we will need to
figure this out but it does not need to happen within the context of your
work. IMO, what you are doing is fine.

If we want to think about solution, one option would be to pass all
necessary arguments to the transform in the form of arguments. In this case
could you could add an explicit project argument to the transform? (If we
do that, how do we ensure that the same project argument is used as the
project given as pipeline option?)

>
> Thanks!
> -P.
>
>
>
>>> Should we simply support RuntimeValueProvider in direct runner?
>>>
>>
>> This is a bit tricky for python. Because it is possible to run multiple
>> pipeline with DirectRunner in the same process (e.g. call run and do not
>> block on results.) RuntimeValueProvider works by setting a global variable
>> and in the case of direct runner multiple pipelines could share the same
>> process, it gets tricky to support this.
>>
>>
>>> Best
>>> -P.
>>>
>>

Re: PipelineOptions at execution time from DirectRunner

Posted by Ahmet Altay <al...@google.com>.
On Thu, Mar 21, 2019 at 5:29 PM Pablo Estrada <pa...@google.com> wrote:

> Thanks Ahmet! These are illustrative explanations.
>
> I still wonder about one question:
>
>>
>>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is
>>> a possiblity, but it seems like that's not ideal. Any other suggestions?
>>>
>> Is this an appropriate way of obtaining an option that is not explicitly
> passed by the user? It prints a warning.
>
> "What's the context?" - I'm working on a transform that writes to big
> query, and table destinations can come in the form "dataset.table", or
> "project:dataset.table". Because these are parsed at runtime (destinations
> are dynamic), the ptransform checks if a project was provided in
> pipelineoptions via value providers.
>

Got it. We wanted to discourage this use and deprecated it. But we have not
really offered a clean alternative solution either. I think we will need to
figure this out but it does not need to happen within the context of your
work. IMO, what you are doing is fine.

If we want to think about solution, one option would be to pass all
necessary arguments to the transform in the form of arguments. In this case
could you could add an explicit project argument to the transform? (If we
do that, how do we ensure that the same project argument is used as the
project given as pipeline option?)

>
> Thanks!
> -P.
>
>
>
>>> Should we simply support RuntimeValueProvider in direct runner?
>>>
>>
>> This is a bit tricky for python. Because it is possible to run multiple
>> pipeline with DirectRunner in the same process (e.g. call run and do not
>> block on results.) RuntimeValueProvider works by setting a global variable
>> and in the case of direct runner multiple pipelines could share the same
>> process, it gets tricky to support this.
>>
>>
>>> Best
>>> -P.
>>>
>>

Re: PipelineOptions at execution time from DirectRunner

Posted by Pablo Estrada <pa...@google.com>.
Thanks Ahmet! These are illustrative explanations.

I still wonder about one question:

>
>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
>> possiblity, but it seems like that's not ideal. Any other suggestions?
>>
> Is this an appropriate way of obtaining an option that is not explicitly
passed by the user? It prints a warning.

"What's the context?" - I'm working on a transform that writes to big
query, and table destinations can come in the form "dataset.table", or
"project:dataset.table". Because these are parsed at runtime (destinations
are dynamic), the ptransform checks if a project was provided in
pipelineoptions via value providers.

Thanks!
-P.



>> Should we simply support RuntimeValueProvider in direct runner?
>>
>
> This is a bit tricky for python. Because it is possible to run multiple
> pipeline with DirectRunner in the same process (e.g. call run and do not
> block on results.) RuntimeValueProvider works by setting a global variable
> and in the case of direct runner multiple pipelines could share the same
> process, it gets tricky to support this.
>
>
>> Best
>> -P.
>>
>

Re: PipelineOptions at execution time from DirectRunner

Posted by Pablo Estrada <pa...@google.com>.
Thanks Ahmet! These are illustrative explanations.

I still wonder about one question:

>
>> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
>> possiblity, but it seems like that's not ideal. Any other suggestions?
>>
> Is this an appropriate way of obtaining an option that is not explicitly
passed by the user? It prints a warning.

"What's the context?" - I'm working on a transform that writes to big
query, and table destinations can come in the form "dataset.table", or
"project:dataset.table". Because these are parsed at runtime (destinations
are dynamic), the ptransform checks if a project was provided in
pipelineoptions via value providers.

Thanks!
-P.



>> Should we simply support RuntimeValueProvider in direct runner?
>>
>
> This is a bit tricky for python. Because it is possible to run multiple
> pipeline with DirectRunner in the same process (e.g. call run and do not
> block on results.) RuntimeValueProvider works by setting a global variable
> and in the case of direct runner multiple pipelines could share the same
> process, it gets tricky to support this.
>
>
>> Best
>> -P.
>>
>

Re: PipelineOptions at execution time from DirectRunner

Posted by Ahmet Altay <al...@google.com>.
On Thu, Mar 21, 2019 at 4:20 PM Pablo Estrada <pa...@google.com> wrote:

> Hi all,
> The DirectRunner does not seem to support RuntimeValueProvider. Is there a
> suggestion for DirectRunner pipelines to access arguments passed in as
> pipeline options(but not necessarily passed explicitly by users) at
> pipeline execution time?
>

RuntimeValueProvider's are useful when some pipeline options are not set at
the time of pipeline construction but available at execution time. In case
of DirectRunner what is available at execution time is same as what is
available at construction time. Am I missing something, why do we
need RuntimeValueProvider support in DirectRunner?


>
> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
> possiblity, but it seems like that's not ideal. Any other suggestions?
>

> Should we simply support RuntimeValueProvider in direct runner?
>

This is a bit tricky for python. Because it is possible to run multiple
pipeline with DirectRunner in the same process (e.g. call run and do not
block on results.) RuntimeValueProvider works by setting a global variable
and in the case of direct runner multiple pipelines could share the same
process, it gets tricky to support this.


> Best
> -P.
>

Re: PipelineOptions at execution time from DirectRunner

Posted by Ahmet Altay <al...@google.com>.
On Thu, Mar 21, 2019 at 4:20 PM Pablo Estrada <pa...@google.com> wrote:

> Hi all,
> The DirectRunner does not seem to support RuntimeValueProvider. Is there a
> suggestion for DirectRunner pipelines to access arguments passed in as
> pipeline options(but not necessarily passed explicitly by users) at
> pipeline execution time?
>

RuntimeValueProvider's are useful when some pipeline options are not set at
the time of pipeline construction but available at execution time. In case
of DirectRunner what is available at execution time is same as what is
available at construction time. Am I missing something, why do we
need RuntimeValueProvider support in DirectRunner?


>
> Getting it as pcoll.pipeline.options in the expand(self, pcoll) call is a
> possiblity, but it seems like that's not ideal. Any other suggestions?
>

> Should we simply support RuntimeValueProvider in direct runner?
>

This is a bit tricky for python. Because it is possible to run multiple
pipeline with DirectRunner in the same process (e.g. call run and do not
block on results.) RuntimeValueProvider works by setting a global variable
and in the case of direct runner multiple pipelines could share the same
process, it gets tricky to support this.


> Best
> -P.
>