You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Pablo Estrada <pa...@google.com> on 2019/07/22 23:08:34 UTC

On Auto-creating GCS buckets on behalf of users

Hello all,
I recently worked on a transform to load data into BigQuery by writing
files to GCS, and issuing Load File jobs to BQ. I did this for the Python
SDK[1].

This option requires the user to provide a GCS bucket to write the files:

   - If the user provides a bucket to the transform, the SDK will use that
   bucket.
   - If the user does not provide a bucket:
      - When running in Dataflow, the SDK will borrow the temp_location of
      the pipeline.
      - When running in other runners, the pipeline will fail.

The Java SDK has had functionality for File Loads into BQ for a long time;
and particularly, when users do not provide a bucket, it attempts to create
a default bucket[2]; and this bucket is used as temp_location (which then
is used by the BQ File Loads transform).

I do not really like creating GCS buckets on behalf of users. In Java, the
outcome is that users will not have to pass a --tempLocation parameter when
submitting jobs to Dataflow - which is a nice convenience, but I'm not sure
that this is in-line with users' expectations.

Currently, the options are:

   - Adding support for bucket autocreation for Python SDK
   - Deprecating support for bucket autocreation in Java SDK, and printing
   a warning.

I am personally inclined for #1. But what do others think?

Best
-P.

[1] https://github.com/apache/beam/pull/7892
[2]
https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343

Re: On Auto-creating GCS buckets on behalf of users

Posted by Robert Bradshaw <ro...@google.com>.
IMHO, we're erroring a bit to far on making it hard to get started. I would
lean towards automatically creating (and using) a bucket, provided it had a
name that was unlikely to conflict with others and very obvious when one
saw it. (Logging is important too, but also very often ignored and not
preserved, especially in the case when things work.)

If we don't create it, I would have a very clear error message stating
"this is how you create it" that the user could run before re-attempting
their pipeline.


On Tue, Jun 23, 2020 at 8:57 AM David Cavazos <dc...@google.com> wrote:

> I like the idea of simplifying the user experience by automating part of
> the initial setup. On the other hand, I see why silently creating billed
> resources like a GCS bucket could be an issue. I don't think creating an
> empty bucket is an issue since it doesn't incur any charges yet, but at
> least logging that it was created by the script in the user's behalf would
> be useful. There could be a logging message saying that it either found the
> bucket with that name and it's using it, or that it didn't find it and it
> created it.
>
> If it were to be creating a resource that could incur potentially unwanted
> charges (like a Bigtable database), then I would make a prompt before
> creating it to make the users confirm they want that created. But for a GCS
> bucket I don't think that's necessary, I think as long as there's an
> explicit message saying it was created should be enough. That way it's not
> a surprise when they see that bucket in their project, or at least they
> know where it came from.
>
> If the user wants more control over what their temp bucket needs, like
> encryption or anything else, they can still pass an explicit
> `--temp_location` parameter and they can use whatever bucket they provide.
>
> I also like that it would be consistent with how the Java SDK works.
>
> On Mon, Jun 22, 2020 at 3:35 PM Ahmet Altay <al...@google.com> wrote:
>
>> I do not have a strong opinion about this either way. I think this is
>> fundamentally a UX tradeoff between making it easier to get started and
>> potentially creating unwanted/misconfigured items. I do not have data about
>> what would be more preferable for most users. I believe either option would
>> be fine as long as we are clear with our messaging, logs, errors.
>>
>> On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> I think creating the bucket makes sense since it is an improvement in
>>> the users experience and simplifies first time users setup needs. We should
>>> be clear to tell users that we are doing this on their behalf.
>>>
>>> On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <pa...@google.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>> I've gotten around to making this change, and Udi has been gracious to
>>>> review it[1].
>>>>
>>>> I figured we have not fully answered the larger question of whether we
>>>> would truly like to make this change. Here are some thoughts giving me
>>>> pause:
>>>>
>>>> 1. Appropriate defaults - We are not sure we can select appropriate
>>>> defaults on behalf of users. (We are erroring out in case of KMS keys, but
>>>> how about other properties?)
>>>> 2. Users have been using Beam's Python SDK the way it is for a long
>>>> time now: Supplying temp_location when running on Dataflow, without a
>>>> problem.
>>>> 3. This has billing implications that users may not be fully aware of
>>>>
>>>> The behavior in [1] matches the behavior of the Java SDK (create a
>>>> bucket when none is supplied AND running on Dataflow); but it still doesn't
>>>> solve the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this
>>>> can be done in a follow up change using the Default Bucket functionality).
>>>>
>>>> My bias in this case is: If it isn't broken, why fix it? I do not know
>>>> of anyone complaining about the required temp_location flag on Dataflow.
>>>>
>>>> I think we can create a default bucket when dealing with BQ outside of
>>>> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
>>>> What do others think?
>>>>
>>>> Best
>>>> -P.
>>>>
>>>> [1] https://github.com/apache/beam/pull/11982
>>>>
>>>> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> I agree with the benefits of auto-creating buckets from an ease of use
>>>>> perspective. My counter argument is that the auto created buckets may not
>>>>> have the right settings for the users. A bucket has multiple settings, some
>>>>> required as (name, storage class) and some optional (acl policy,
>>>>> encryption, retention policy, labels). As the number of options increase
>>>>> our chances of having a good enough default goes down. For example, if a
>>>>> user wants to enable CMEK mode for encryption, they will enable it for
>>>>> their sources, sinks, and will instruct Dataflow runner encrypt its
>>>>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>>>>> user would be against user's intentions. We would not be able to create a
>>>>> bucket either, because we would not know what encryption keys to use for
>>>>> such a bucket. Our options would be to either not create a bucket at all,
>>>>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>>>>
>>>>> There is a similar issue with the region flag. If unspecified it
>>>>> defaults to us-central1. This is convenient for new users, but not making
>>>>> that flag required will expose a larger proportion of Dataflow users to
>>>>> events in that specific region.
>>>>>
>>>>> Robert's suggestion of having a flag for opt-in to a default set of
>>>>> GCP convenience flags sounds reasonable. At least users will explicitly
>>>>> acknowledge that certain things are auto managed for them.
>>>>>
>>>>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>>>>>
>>>>>> Another idea would be to put default bucket preferences in a .beamrc
>>>>>> file so you don't have to remember to pass it every time (this could also
>>>>>> contain other default flag values).
>>>>>>
>>>>>
>>>>> IMO, the first question is whether auto-creation based on some
>>>>> unconfigurable defaults would happen or not. Once we agree on that, having
>>>>> an rc file vs flags vs supporting both would be a UX question.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>>>>> <ch...@google.com> wrote:
>>>>>>> >
>>>>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I agree with David that at least clearer log statements should be
>>>>>>> added.
>>>>>>> >>
>>>>>>> >> Udi, that's an interesting idea, but I imagine the sheer number
>>>>>>> of existing flags (including many SDK-specific flags) would make it
>>>>>>> difficult to implement. In addition, uniform argument names wouldn't
>>>>>>> necessarily ensure uniform implementation.
>>>>>>> >>
>>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>>>> kcweaver@google.com
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> Java SDK creates one regional bucket per project and region
>>>>>>> combination.
>>>>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>>>>> >
>>>>>>> >
>>>>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>>>>> single bucket per project and region. I assume we are creating temporary
>>>>>>> folders for each pipeline with the same region and project so that they
>>>>>>> don't conclifc (which we clean up).
>>>>>>> > As others mentioned we should clearly document this (including the
>>>>>>> naming of the bucket) and produce a log during pipeline creating.
>>>>>>> >
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> I agree with Robert that having less flags is better.
>>>>>>> >>> Perhaps what we need a unifying interface for SDKs that
>>>>>>> simplifies launching?
>>>>>>> >>>
>>>>>>> >>> So instead of:
>>>>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>>>>> >>> or
>>>>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>>>>> >
>>>>>>> > Interesting, probably this should be extended to a generalized CLI
>>>>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>>>>
>>>>>>> This is starting to get somewhat off-topic from the original
>>>>>>> question,
>>>>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>>>>> developers, python -m module, or even python -m path/to/script.py is
>>>>>>> pretty standard. Java is a bit harder, because one needs to
>>>>>>> coordinate
>>>>>>> a build as well, but I don't know how a "./beam java ..." script
>>>>>>> would
>>>>>>> gloss over whether one is using maven, gradle, ant, or just has a
>>>>>>> pile
>>>>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>>>>> project layout as well to invoke the right commands).
>>>>>>>
>>>>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by David Cavazos <dc...@google.com>.
I like the idea of simplifying the user experience by automating part of
the initial setup. On the other hand, I see why silently creating billed
resources like a GCS bucket could be an issue. I don't think creating an
empty bucket is an issue since it doesn't incur any charges yet, but at
least logging that it was created by the script in the user's behalf would
be useful. There could be a logging message saying that it either found the
bucket with that name and it's using it, or that it didn't find it and it
created it.

If it were to be creating a resource that could incur potentially unwanted
charges (like a Bigtable database), then I would make a prompt before
creating it to make the users confirm they want that created. But for a GCS
bucket I don't think that's necessary, I think as long as there's an
explicit message saying it was created should be enough. That way it's not
a surprise when they see that bucket in their project, or at least they
know where it came from.

If the user wants more control over what their temp bucket needs, like
encryption or anything else, they can still pass an explicit
`--temp_location` parameter and they can use whatever bucket they provide.

I also like that it would be consistent with how the Java SDK works.

On Mon, Jun 22, 2020 at 3:35 PM Ahmet Altay <al...@google.com> wrote:

> I do not have a strong opinion about this either way. I think this is
> fundamentally a UX tradeoff between making it easier to get started and
> potentially creating unwanted/misconfigured items. I do not have data about
> what would be more preferable for most users. I believe either option would
> be fine as long as we are clear with our messaging, logs, errors.
>
> On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik <lc...@google.com> wrote:
>
>> I think creating the bucket makes sense since it is an improvement in the
>> users experience and simplifies first time users setup needs. We should be
>> clear to tell users that we are doing this on their behalf.
>>
>> On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <pa...@google.com> wrote:
>>
>>> Hi everyone,
>>> I've gotten around to making this change, and Udi has been gracious to
>>> review it[1].
>>>
>>> I figured we have not fully answered the larger question of whether we
>>> would truly like to make this change. Here are some thoughts giving me
>>> pause:
>>>
>>> 1. Appropriate defaults - We are not sure we can select appropriate
>>> defaults on behalf of users. (We are erroring out in case of KMS keys, but
>>> how about other properties?)
>>> 2. Users have been using Beam's Python SDK the way it is for a long time
>>> now: Supplying temp_location when running on Dataflow, without a problem.
>>> 3. This has billing implications that users may not be fully aware of
>>>
>>> The behavior in [1] matches the behavior of the Java SDK (create a
>>> bucket when none is supplied AND running on Dataflow); but it still doesn't
>>> solve the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this
>>> can be done in a follow up change using the Default Bucket functionality).
>>>
>>> My bias in this case is: If it isn't broken, why fix it? I do not know
>>> of anyone complaining about the required temp_location flag on Dataflow.
>>>
>>> I think we can create a default bucket when dealing with BQ outside of
>>> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
>>> What do others think?
>>>
>>> Best
>>> -P.
>>>
>>> [1] https://github.com/apache/beam/pull/11982
>>>
>>> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> I agree with the benefits of auto-creating buckets from an ease of use
>>>> perspective. My counter argument is that the auto created buckets may not
>>>> have the right settings for the users. A bucket has multiple settings, some
>>>> required as (name, storage class) and some optional (acl policy,
>>>> encryption, retention policy, labels). As the number of options increase
>>>> our chances of having a good enough default goes down. For example, if a
>>>> user wants to enable CMEK mode for encryption, they will enable it for
>>>> their sources, sinks, and will instruct Dataflow runner encrypt its
>>>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>>>> user would be against user's intentions. We would not be able to create a
>>>> bucket either, because we would not know what encryption keys to use for
>>>> such a bucket. Our options would be to either not create a bucket at all,
>>>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>>>
>>>> There is a similar issue with the region flag. If unspecified it
>>>> defaults to us-central1. This is convenient for new users, but not making
>>>> that flag required will expose a larger proportion of Dataflow users to
>>>> events in that specific region.
>>>>
>>>> Robert's suggestion of having a flag for opt-in to a default set of GCP
>>>> convenience flags sounds reasonable. At least users will explicitly
>>>> acknowledge that certain things are auto managed for them.
>>>>
>>>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> Another idea would be to put default bucket preferences in a .beamrc
>>>>> file so you don't have to remember to pass it every time (this could also
>>>>> contain other default flag values).
>>>>>
>>>>
>>>> IMO, the first question is whether auto-creation based on some
>>>> unconfigurable defaults would happen or not. Once we agree on that, having
>>>> an rc file vs flags vs supporting both would be a UX question.
>>>>
>>>>
>>>>>
>>>>>
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>>>> <ch...@google.com> wrote:
>>>>>> >
>>>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> I agree with David that at least clearer log statements should be
>>>>>> added.
>>>>>> >>
>>>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>>>>> existing flags (including many SDK-specific flags) would make it difficult
>>>>>> to implement. In addition, uniform argument names wouldn't necessarily
>>>>>> ensure uniform implementation.
>>>>>> >>
>>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>>> kcweaver@google.com
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Java SDK creates one regional bucket per project and region
>>>>>> combination.
>>>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>>>> >
>>>>>> >
>>>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>>>> single bucket per project and region. I assume we are creating temporary
>>>>>> folders for each pipeline with the same region and project so that they
>>>>>> don't conclifc (which we clean up).
>>>>>> > As others mentioned we should clearly document this (including the
>>>>>> naming of the bucket) and produce a log during pipeline creating.
>>>>>> >
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I agree with Robert that having less flags is better.
>>>>>> >>> Perhaps what we need a unifying interface for SDKs that
>>>>>> simplifies launching?
>>>>>> >>>
>>>>>> >>> So instead of:
>>>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>>>> >>> or
>>>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>>>> >
>>>>>> > Interesting, probably this should be extended to a generalized CLI
>>>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>>>
>>>>>> This is starting to get somewhat off-topic from the original question,
>>>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>>>> developers, python -m module, or even python -m path/to/script.py is
>>>>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>>>>> a build as well, but I don't know how a "./beam java ..." script would
>>>>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>>>> project layout as well to invoke the right commands).
>>>>>>
>>>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Ahmet Altay <al...@google.com>.
I do not have a strong opinion about this either way. I think this is
fundamentally a UX tradeoff between making it easier to get started and
potentially creating unwanted/misconfigured items. I do not have data about
what would be more preferable for most users. I believe either option would
be fine as long as we are clear with our messaging, logs, errors.

On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik <lc...@google.com> wrote:

> I think creating the bucket makes sense since it is an improvement in the
> users experience and simplifies first time users setup needs. We should be
> clear to tell users that we are doing this on their behalf.
>
> On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hi everyone,
>> I've gotten around to making this change, and Udi has been gracious to
>> review it[1].
>>
>> I figured we have not fully answered the larger question of whether we
>> would truly like to make this change. Here are some thoughts giving me
>> pause:
>>
>> 1. Appropriate defaults - We are not sure we can select appropriate
>> defaults on behalf of users. (We are erroring out in case of KMS keys, but
>> how about other properties?)
>> 2. Users have been using Beam's Python SDK the way it is for a long time
>> now: Supplying temp_location when running on Dataflow, without a problem.
>> 3. This has billing implications that users may not be fully aware of
>>
>> The behavior in [1] matches the behavior of the Java SDK (create a bucket
>> when none is supplied AND running on Dataflow); but it still doesn't solve
>> the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be
>> done in a follow up change using the Default Bucket functionality).
>>
>> My bias in this case is: If it isn't broken, why fix it? I do not know of
>> anyone complaining about the required temp_location flag on Dataflow.
>>
>> I think we can create a default bucket when dealing with BQ outside of
>> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
>> What do others think?
>>
>> Best
>> -P.
>>
>> [1] https://github.com/apache/beam/pull/11982
>>
>> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> I agree with the benefits of auto-creating buckets from an ease of use
>>> perspective. My counter argument is that the auto created buckets may not
>>> have the right settings for the users. A bucket has multiple settings, some
>>> required as (name, storage class) and some optional (acl policy,
>>> encryption, retention policy, labels). As the number of options increase
>>> our chances of having a good enough default goes down. For example, if a
>>> user wants to enable CMEK mode for encryption, they will enable it for
>>> their sources, sinks, and will instruct Dataflow runner encrypt its
>>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>>> user would be against user's intentions. We would not be able to create a
>>> bucket either, because we would not know what encryption keys to use for
>>> such a bucket. Our options would be to either not create a bucket at all,
>>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>>
>>> There is a similar issue with the region flag. If unspecified it
>>> defaults to us-central1. This is convenient for new users, but not making
>>> that flag required will expose a larger proportion of Dataflow users to
>>> events in that specific region.
>>>
>>> Robert's suggestion of having a flag for opt-in to a default set of GCP
>>> convenience flags sounds reasonable. At least users will explicitly
>>> acknowledge that certain things are auto managed for them.
>>>
>>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>>>
>>>> Another idea would be to put default bucket preferences in a .beamrc
>>>> file so you don't have to remember to pass it every time (this could also
>>>> contain other default flag values).
>>>>
>>>
>>> IMO, the first question is whether auto-creation based on some
>>> unconfigurable defaults would happen or not. Once we agree on that, having
>>> an rc file vs flags vs supporting both would be a UX question.
>>>
>>>
>>>>
>>>>
>>>
>>>>
>>>>
>>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>>> <ch...@google.com> wrote:
>>>>> >
>>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I agree with David that at least clearer log statements should be
>>>>> added.
>>>>> >>
>>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>>>> existing flags (including many SDK-specific flags) would make it difficult
>>>>> to implement. In addition, uniform argument names wouldn't necessarily
>>>>> ensure uniform implementation.
>>>>> >>
>>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>>> kcweaver@google.com
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Java SDK creates one regional bucket per project and region
>>>>> combination.
>>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>>> >
>>>>> >
>>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>>> single bucket per project and region. I assume we are creating temporary
>>>>> folders for each pipeline with the same region and project so that they
>>>>> don't conclifc (which we clean up).
>>>>> > As others mentioned we should clearly document this (including the
>>>>> naming of the bucket) and produce a log during pipeline creating.
>>>>> >
>>>>> >>>
>>>>> >>>
>>>>> >>> I agree with Robert that having less flags is better.
>>>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>>>>> launching?
>>>>> >>>
>>>>> >>> So instead of:
>>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>>> >>> or
>>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>>> >
>>>>> > Interesting, probably this should be extended to a generalized CLI
>>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>>
>>>>> This is starting to get somewhat off-topic from the original question,
>>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>>> developers, python -m module, or even python -m path/to/script.py is
>>>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>>>> a build as well, but I don't know how a "./beam java ..." script would
>>>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>>> project layout as well to invoke the right commands).
>>>>>
>>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Luke Cwik <lc...@google.com>.
I think creating the bucket makes sense since it is an improvement in the
users experience and simplifies first time users setup needs. We should be
clear to tell users that we are doing this on their behalf.

On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <pa...@google.com> wrote:

> Hi everyone,
> I've gotten around to making this change, and Udi has been gracious to
> review it[1].
>
> I figured we have not fully answered the larger question of whether we
> would truly like to make this change. Here are some thoughts giving me
> pause:
>
> 1. Appropriate defaults - We are not sure we can select appropriate
> defaults on behalf of users. (We are erroring out in case of KMS keys, but
> how about other properties?)
> 2. Users have been using Beam's Python SDK the way it is for a long time
> now: Supplying temp_location when running on Dataflow, without a problem.
> 3. This has billing implications that users may not be fully aware of
>
> The behavior in [1] matches the behavior of the Java SDK (create a bucket
> when none is supplied AND running on Dataflow); but it still doesn't solve
> the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be
> done in a follow up change using the Default Bucket functionality).
>
> My bias in this case is: If it isn't broken, why fix it? I do not know of
> anyone complaining about the required temp_location flag on Dataflow.
>
> I think we can create a default bucket when dealing with BQ outside of
> Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
> What do others think?
>
> Best
> -P.
>
> [1] https://github.com/apache/beam/pull/11982
>
> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:
>
>> I agree with the benefits of auto-creating buckets from an ease of use
>> perspective. My counter argument is that the auto created buckets may not
>> have the right settings for the users. A bucket has multiple settings, some
>> required as (name, storage class) and some optional (acl policy,
>> encryption, retention policy, labels). As the number of options increase
>> our chances of having a good enough default goes down. For example, if a
>> user wants to enable CMEK mode for encryption, they will enable it for
>> their sources, sinks, and will instruct Dataflow runner encrypt its
>> in-flight data. Creating a default (non-encrpyted) temp bucket for this
>> user would be against user's intentions. We would not be able to create a
>> bucket either, because we would not know what encryption keys to use for
>> such a bucket. Our options would be to either not create a bucket at all,
>> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>>
>> There is a similar issue with the region flag. If unspecified it defaults
>> to us-central1. This is convenient for new users, but not making that flag
>> required will expose a larger proportion of Dataflow users to events in
>> that specific region.
>>
>> Robert's suggestion of having a flag for opt-in to a default set of GCP
>> convenience flags sounds reasonable. At least users will explicitly
>> acknowledge that certain things are auto managed for them.
>>
>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>>
>>> Another idea would be to put default bucket preferences in a .beamrc
>>> file so you don't have to remember to pass it every time (this could also
>>> contain other default flag values).
>>>
>>
>> IMO, the first question is whether auto-creation based on some
>> unconfigurable defaults would happen or not. Once we agree on that, having
>> an rc file vs flags vs supporting both would be a UX question.
>>
>>
>>>
>>>
>>
>>>
>>>
>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>>> <ch...@google.com> wrote:
>>>> >
>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>>>> wrote:
>>>> >>
>>>> >> I agree with David that at least clearer log statements should be
>>>> added.
>>>> >>
>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>>> existing flags (including many SDK-specific flags) would make it difficult
>>>> to implement. In addition, uniform argument names wouldn't necessarily
>>>> ensure uniform implementation.
>>>> >>
>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcweaver@google.com
>>>> >>
>>>> >>
>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>>>> >>>
>>>> >>> Java SDK creates one regional bucket per project and region
>>>> combination.
>>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>>> >
>>>> >
>>>> > Agree that cleanup is not a bit issue if we are only creating a
>>>> single bucket per project and region. I assume we are creating temporary
>>>> folders for each pipeline with the same region and project so that they
>>>> don't conclifc (which we clean up).
>>>> > As others mentioned we should clearly document this (including the
>>>> naming of the bucket) and produce a log during pipeline creating.
>>>> >
>>>> >>>
>>>> >>>
>>>> >>> I agree with Robert that having less flags is better.
>>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>>>> launching?
>>>> >>>
>>>> >>> So instead of:
>>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>>> >>> or
>>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>>> >
>>>> > Interesting, probably this should be extended to a generalized CLI
>>>> for Beam that can be easily installed to execute Beam pipelines ?
>>>>
>>>> This is starting to get somewhat off-topic from the original question,
>>>> but I'm not sure the benefits of providing a wrapper to the end user
>>>> would outweigh the costs of having to learn the wrapper. For Python
>>>> developers, python -m module, or even python -m path/to/script.py is
>>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>>> a build as well, but I don't know how a "./beam java ..." script would
>>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>>> of pre-compiled jara (and would probably have to know a bit about the
>>>> project layout as well to invoke the right commands).
>>>>
>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Pablo Estrada <pa...@google.com>.
Hi everyone,
I've gotten around to making this change, and Udi has been gracious to
review it[1].

I figured we have not fully answered the larger question of whether we
would truly like to make this change. Here are some thoughts giving me
pause:

1. Appropriate defaults - We are not sure we can select appropriate
defaults on behalf of users. (We are erroring out in case of KMS keys, but
how about other properties?)
2. Users have been using Beam's Python SDK the way it is for a long time
now: Supplying temp_location when running on Dataflow, without a problem.
3. This has billing implications that users may not be fully aware of

The behavior in [1] matches the behavior of the Java SDK (create a bucket
when none is supplied AND running on Dataflow); but it still doesn't solve
the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be
done in a follow up change using the Default Bucket functionality).

My bias in this case is: If it isn't broken, why fix it? I do not know of
anyone complaining about the required temp_location flag on Dataflow.

I think we can create a default bucket when dealing with BQ outside of
Dataflow, but for Dataflow, I think we don't need to fix what's not broken.
What do others think?

Best
-P.

[1] https://github.com/apache/beam/pull/11982

On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <al...@google.com> wrote:

> I agree with the benefits of auto-creating buckets from an ease of use
> perspective. My counter argument is that the auto created buckets may not
> have the right settings for the users. A bucket has multiple settings, some
> required as (name, storage class) and some optional (acl policy,
> encryption, retention policy, labels). As the number of options increase
> our chances of having a good enough default goes down. For example, if a
> user wants to enable CMEK mode for encryption, they will enable it for
> their sources, sinks, and will instruct Dataflow runner encrypt its
> in-flight data. Creating a default (non-encrpyted) temp bucket for this
> user would be against user's intentions. We would not be able to create a
> bucket either, because we would not know what encryption keys to use for
> such a bucket. Our options would be to either not create a bucket at all,
> or fail if a temporary bucket was not specified and a CMEK mode is enabled.
>
> There is a similar issue with the region flag. If unspecified it defaults
> to us-central1. This is convenient for new users, but not making that flag
> required will expose a larger proportion of Dataflow users to events in
> that specific region.
>
> Robert's suggestion of having a flag for opt-in to a default set of GCP
> convenience flags sounds reasonable. At least users will explicitly
> acknowledge that certain things are auto managed for them.
>
> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:
>
>> Another idea would be to put default bucket preferences in a .beamrc file
>> so you don't have to remember to pass it every time (this could also
>> contain other default flag values).
>>
>
> IMO, the first question is whether auto-creation based on some
> unconfigurable defaults would happen or not. Once we agree on that, having
> an rc file vs flags vs supporting both would be a UX question.
>
>
>>
>>
>
>>
>>
>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>>> <ch...@google.com> wrote:
>>> >
>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>>> wrote:
>>> >>
>>> >> I agree with David that at least clearer log statements should be
>>> added.
>>> >>
>>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>>> existing flags (including many SDK-specific flags) would make it difficult
>>> to implement. In addition, uniform argument names wouldn't necessarily
>>> ensure uniform implementation.
>>> >>
>>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcweaver@google.com
>>> >>
>>> >>
>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>>> >>>
>>> >>> Java SDK creates one regional bucket per project and region
>>> combination.
>>> >>> So it's not a lot of buckets - no need to auto-clean.
>>> >
>>> >
>>> > Agree that cleanup is not a bit issue if we are only creating a single
>>> bucket per project and region. I assume we are creating temporary folders
>>> for each pipeline with the same region and project so that they don't
>>> conclifc (which we clean up).
>>> > As others mentioned we should clearly document this (including the
>>> naming of the bucket) and produce a log during pipeline creating.
>>> >
>>> >>>
>>> >>>
>>> >>> I agree with Robert that having less flags is better.
>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>>> launching?
>>> >>>
>>> >>> So instead of:
>>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>>> -Dexec.args="--runner=DataflowRunner --project=<project>
>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>> >>> or
>>> >>> python -m <module> --runner DataflowRunner --project <project>
>>> --temp_location gs://<bucket>/tmp/ <user flags>
>>> >
>>> > Interesting, probably this should be extended to a generalized CLI for
>>> Beam that can be easily installed to execute Beam pipelines ?
>>>
>>> This is starting to get somewhat off-topic from the original question,
>>> but I'm not sure the benefits of providing a wrapper to the end user
>>> would outweigh the costs of having to learn the wrapper. For Python
>>> developers, python -m module, or even python -m path/to/script.py is
>>> pretty standard. Java is a bit harder, because one needs to coordinate
>>> a build as well, but I don't know how a "./beam java ..." script would
>>> gloss over whether one is using maven, gradle, ant, or just has a pile
>>> of pre-compiled jara (and would probably have to know a bit about the
>>> project layout as well to invoke the right commands).
>>>
>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Ahmet Altay <al...@google.com>.
I agree with the benefits of auto-creating buckets from an ease of use
perspective. My counter argument is that the auto created buckets may not
have the right settings for the users. A bucket has multiple settings, some
required as (name, storage class) and some optional (acl policy,
encryption, retention policy, labels). As the number of options increase
our chances of having a good enough default goes down. For example, if a
user wants to enable CMEK mode for encryption, they will enable it for
their sources, sinks, and will instruct Dataflow runner encrypt its
in-flight data. Creating a default (non-encrpyted) temp bucket for this
user would be against user's intentions. We would not be able to create a
bucket either, because we would not know what encryption keys to use for
such a bucket. Our options would be to either not create a bucket at all,
or fail if a temporary bucket was not specified and a CMEK mode is enabled.

There is a similar issue with the region flag. If unspecified it defaults
to us-central1. This is convenient for new users, but not making that flag
required will expose a larger proportion of Dataflow users to events in
that specific region.

Robert's suggestion of having a flag for opt-in to a default set of GCP
convenience flags sounds reasonable. At least users will explicitly
acknowledge that certain things are auto managed for them.

On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:

> Another idea would be to put default bucket preferences in a .beamrc file
> so you don't have to remember to pass it every time (this could also
> contain other default flag values).
>

IMO, the first question is whether auto-creation based on some
unconfigurable defaults would happen or not. Once we agree on that, having
an rc file vs flags vs supporting both would be a UX question.


>
>

>
>
> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>> <ch...@google.com> wrote:
>> >
>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com>
>> wrote:
>> >>
>> >> I agree with David that at least clearer log statements should be
>> added.
>> >>
>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>> existing flags (including many SDK-specific flags) would make it difficult
>> to implement. In addition, uniform argument names wouldn't necessarily
>> ensure uniform implementation.
>> >>
>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcweaver@google.com
>> >>
>> >>
>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>> >>>
>> >>> Java SDK creates one regional bucket per project and region
>> combination.
>> >>> So it's not a lot of buckets - no need to auto-clean.
>> >
>> >
>> > Agree that cleanup is not a bit issue if we are only creating a single
>> bucket per project and region. I assume we are creating temporary folders
>> for each pipeline with the same region and project so that they don't
>> conclifc (which we clean up).
>> > As others mentioned we should clearly document this (including the
>> naming of the bucket) and produce a log during pipeline creating.
>> >
>> >>>
>> >>>
>> >>> I agree with Robert that having less flags is better.
>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>> launching?
>> >>>
>> >>> So instead of:
>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>> -Dexec.args="--runner=DataflowRunner --project=<project>
>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>> >>> or
>> >>> python -m <module> --runner DataflowRunner --project <project>
>> --temp_location gs://<bucket>/tmp/ <user flags>
>> >
>> > Interesting, probably this should be extended to a generalized CLI for
>> Beam that can be easily installed to execute Beam pipelines ?
>>
>> This is starting to get somewhat off-topic from the original question,
>> but I'm not sure the benefits of providing a wrapper to the end user
>> would outweigh the costs of having to learn the wrapper. For Python
>> developers, python -m module, or even python -m path/to/script.py is
>> pretty standard. Java is a bit harder, because one needs to coordinate
>> a build as well, but I don't know how a "./beam java ..." script would
>> gloss over whether one is using maven, gradle, ant, or just has a pile
>> of pre-compiled jara (and would probably have to know a bit about the
>> project layout as well to invoke the right commands).
>>
>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Udi Meiri <eh...@google.com>.
Another idea would be to put default bucket preferences in a .beamrc file
so you don't have to remember to pass it every time (this could also
contain other default flag values).



On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
> <ch...@google.com> wrote:
> >
> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com> wrote:
> >>
> >> I agree with David that at least clearer log statements should be added.
> >>
> >> Udi, that's an interesting idea, but I imagine the sheer number of
> existing flags (including many SDK-specific flags) would make it difficult
> to implement. In addition, uniform argument names wouldn't necessarily
> ensure uniform implementation.
> >>
> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcweaver@google.com
> >>
> >>
> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
> >>>
> >>> Java SDK creates one regional bucket per project and region
> combination.
> >>> So it's not a lot of buckets - no need to auto-clean.
> >
> >
> > Agree that cleanup is not a bit issue if we are only creating a single
> bucket per project and region. I assume we are creating temporary folders
> for each pipeline with the same region and project so that they don't
> conclifc (which we clean up).
> > As others mentioned we should clearly document this (including the
> naming of the bucket) and produce a log during pipeline creating.
> >
> >>>
> >>>
> >>> I agree with Robert that having less flags is better.
> >>> Perhaps what we need a unifying interface for SDKs that simplifies
> launching?
> >>>
> >>> So instead of:
> >>> mvn compile exec:java -Dexec.mainClass=<class>
> -Dexec.args="--runner=DataflowRunner --project=<project>
> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
> >>> or
> >>> python -m <module> --runner DataflowRunner --project <project>
> --temp_location gs://<bucket>/tmp/ <user flags>
> >
> > Interesting, probably this should be extended to a generalized CLI for
> Beam that can be easily installed to execute Beam pipelines ?
>
> This is starting to get somewhat off-topic from the original question,
> but I'm not sure the benefits of providing a wrapper to the end user
> would outweigh the costs of having to learn the wrapper. For Python
> developers, python -m module, or even python -m path/to/script.py is
> pretty standard. Java is a bit harder, because one needs to coordinate
> a build as well, but I don't know how a "./beam java ..." script would
> gloss over whether one is using maven, gradle, ant, or just has a pile
> of pre-compiled jara (and would probably have to know a bit about the
> project layout as well to invoke the right commands).
>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Robert Bradshaw <ro...@google.com>.
On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
<ch...@google.com> wrote:
>
> On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com> wrote:
>>
>> I agree with David that at least clearer log statements should be added.
>>
>> Udi, that's an interesting idea, but I imagine the sheer number of existing flags (including many SDK-specific flags) would make it difficult to implement. In addition, uniform argument names wouldn't necessarily ensure uniform implementation.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>>
>>
>> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>>>
>>> Java SDK creates one regional bucket per project and region combination.
>>> So it's not a lot of buckets - no need to auto-clean.
>
>
> Agree that cleanup is not a bit issue if we are only creating a single bucket per project and region. I assume we are creating temporary folders for each pipeline with the same region and project so that they don't conclifc (which we clean up).
> As others mentioned we should clearly document this (including the naming of the bucket) and produce a log during pipeline creating.
>
>>>
>>>
>>> I agree with Robert that having less flags is better.
>>> Perhaps what we need a unifying interface for SDKs that simplifies launching?
>>>
>>> So instead of:
>>> mvn compile exec:java -Dexec.mainClass=<class> -Dexec.args="--runner=DataflowRunner --project=<project> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>>> or
>>> python -m <module> --runner DataflowRunner --project <project> --temp_location gs://<bucket>/tmp/ <user flags>
>
> Interesting, probably this should be extended to a generalized CLI for Beam that can be easily installed to execute Beam pipelines ?

This is starting to get somewhat off-topic from the original question,
but I'm not sure the benefits of providing a wrapper to the end user
would outweigh the costs of having to learn the wrapper. For Python
developers, python -m module, or even python -m path/to/script.py is
pretty standard. Java is a bit harder, because one needs to coordinate
a build as well, but I don't know how a "./beam java ..." script would
gloss over whether one is using maven, gradle, ant, or just has a pile
of pre-compiled jara (and would probably have to know a bit about the
project layout as well to invoke the right commands).

Re: On Auto-creating GCS buckets on behalf of users

Posted by Chamikara Jayalath <ch...@google.com>.
On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kc...@google.com> wrote:

> I agree with David that at least clearer log statements should be added.
>
> Udi, that's an interesting idea, but I imagine the sheer number of
> existing flags (including many SDK-specific flags) would make it difficult
> to implement. In addition, uniform argument names wouldn't necessarily
> ensure uniform implementation.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com
>
>
> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>
>> Java SDK creates one regional bucket per project and region combination
>> <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318>
>> .
>> So it's not a lot of buckets - no need to auto-clean.
>>
>
Agree that cleanup is not a bit issue if we are only creating a single
bucket per project and region. I assume we are creating temporary folders
for each pipeline with the same region and project so that they don't
conclifc (which we clean up).
As others mentioned we should clearly document this (including the naming
of the bucket) and produce a log during pipeline creating.


>
>> I agree with Robert that having less flags is better.
>> Perhaps what we need a unifying interface for SDKs that simplifies
>> launching?
>>
>> So instead of:
>> mvn compile exec:java -Dexec.mainClass=<class>
>> -Dexec.args="--runner=DataflowRunner --project=<project>
>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>> or
>> python -m <module> --runner DataflowRunner --project
>> <project> --temp_location gs://<bucket>/tmp/ <user flags>
>>
>
Interesting, probably this should be extended to a generalized CLI for Beam
that can be easily installed to execute Beam pipelines ?

Thanks,
Cham



>
>> We could have:
>> ./beam java run <class> --runner=DataflowRunner <user flags>
>> ./beam python run <module> --runner=DataflowRunner <user flags>
>>
>> where GCP project and temp_location are optional.
>>
>> On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dc...@google.com>
>> wrote:
>>
>>> I would go for #1 since it's a better user experience. Especially for
>>> new users who don't understand every step involved on staging/deploying.
>>> It's just another (unnecessary) mental concept they don't have to be aware
>>> of. Anything that makes it closer to only providing the `--runner` flag
>>> without any additional flags (by default, but configurable if necessary) is
>>> a good thing in my opinion.
>>>
>>> AutoML already auto-creates a GCS bucket (not configurable, with a
>>> global name which has its own downfalls). Other products are already doing
>>> this to simplify user experience. I think as long as there's an explicit
>>> logging statement it should be fine.
>>>
>>> If the bucket was not specified and was created: "No --temp_location
>>> specified, created gs://..."
>>>
>>> If the bucket was not specified and was found: "No --temp_location
>>> specified, found gs://..."
>>>
>>> If the bucket was specified, the logging could be omitted since it's
>>> already explicit from the command line arguments.
>>>
>>> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <
>>> chamikara@google.com> wrote:
>>>
>>>> Do we clean up auto created GCS buckets ?
>>>>
>>>> If there's no good way to cleanup, I think it might be better to make
>>>> this opt-in.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> I think having a single, default, auto-created temporary bucket per
>>>>> project for use in GCP (when running on Dataflow, or running elsewhere
>>>>> but using GCS such as for this BQ load files example), though not
>>>>> ideal, is the best user experience. If we don't want to be
>>>>> automatically creating such things for users by default, another
>>>>> option would be a single flag that opts-in to such auto-creation
>>>>> (which could include other resources in the future).
>>>>>
>>>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com>
>>>>> wrote:
>>>>> >
>>>>> > Hello all,
>>>>> > I recently worked on a transform to load data into BigQuery by
>>>>> writing files to GCS, and issuing Load File jobs to BQ. I did this for the
>>>>> Python SDK[1].
>>>>> >
>>>>> > This option requires the user to provide a GCS bucket to write the
>>>>> files:
>>>>> >
>>>>> > If the user provides a bucket to the transform, the SDK will use
>>>>> that bucket.
>>>>> > If the user does not provide a bucket:
>>>>> >
>>>>> > When running in Dataflow, the SDK will borrow the temp_location of
>>>>> the pipeline.
>>>>> > When running in other runners, the pipeline will fail.
>>>>> >
>>>>> > The Java SDK has had functionality for File Loads into BQ for a long
>>>>> time; and particularly, when users do not provide a bucket, it attempts to
>>>>> create a default bucket[2]; and this bucket is used as temp_location (which
>>>>> then is used by the BQ File Loads transform).
>>>>> >
>>>>> > I do not really like creating GCS buckets on behalf of users. In
>>>>> Java, the outcome is that users will not have to pass a --tempLocation
>>>>> parameter when submitting jobs to Dataflow - which is a nice convenience,
>>>>> but I'm not sure that this is in-line with users' expectations.
>>>>> >
>>>>> > Currently, the options are:
>>>>> >
>>>>> > Adding support for bucket autocreation for Python SDK
>>>>> > Deprecating support for bucket autocreation in Java SDK, and
>>>>> printing a warning.
>>>>> >
>>>>> > I am personally inclined for #1. But what do others think?
>>>>> >
>>>>> > Best
>>>>> > -P.
>>>>> >
>>>>> > [1] https://github.com/apache/beam/pull/7892
>>>>> > [2]
>>>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>>>>
>>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Kyle Weaver <kc...@google.com>.
I agree with David that at least clearer log statements should be added.

Udi, that's an interesting idea, but I imagine the sheer number of existing
flags (including many SDK-specific flags) would make it difficult to
implement. In addition, uniform argument names wouldn't necessarily ensure
uniform implementation.

Kyle Weaver | Software Engineer | github.com/ibzib | kcweaver@google.com


On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:

> Java SDK creates one regional bucket per project and region combination
> <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318>
> .
> So it's not a lot of buckets - no need to auto-clean.
>
> I agree with Robert that having less flags is better.
> Perhaps what we need a unifying interface for SDKs that simplifies
> launching?
>
> So instead of:
> mvn compile exec:java -Dexec.mainClass=<class>
> -Dexec.args="--runner=DataflowRunner --project=<project>
> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
> or
> python -m <module> --runner DataflowRunner --project
> <project> --temp_location gs://<bucket>/tmp/ <user flags>
>
> We could have:
> ./beam java run <class> --runner=DataflowRunner <user flags>
> ./beam python run <module> --runner=DataflowRunner <user flags>
>
> where GCP project and temp_location are optional.
>
> On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dc...@google.com>
> wrote:
>
>> I would go for #1 since it's a better user experience. Especially for new
>> users who don't understand every step involved on staging/deploying. It's
>> just another (unnecessary) mental concept they don't have to be aware of.
>> Anything that makes it closer to only providing the `--runner` flag without
>> any additional flags (by default, but configurable if necessary) is a good
>> thing in my opinion.
>>
>> AutoML already auto-creates a GCS bucket (not configurable, with a global
>> name which has its own downfalls). Other products are already doing this to
>> simplify user experience. I think as long as there's an explicit logging
>> statement it should be fine.
>>
>> If the bucket was not specified and was created: "No --temp_location
>> specified, created gs://..."
>>
>> If the bucket was not specified and was found: "No --temp_location
>> specified, found gs://..."
>>
>> If the bucket was specified, the logging could be omitted since it's
>> already explicit from the command line arguments.
>>
>> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Do we clean up auto created GCS buckets ?
>>>
>>> If there's no good way to cleanup, I think it might be better to make
>>> this opt-in.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I think having a single, default, auto-created temporary bucket per
>>>> project for use in GCP (when running on Dataflow, or running elsewhere
>>>> but using GCS such as for this BQ load files example), though not
>>>> ideal, is the best user experience. If we don't want to be
>>>> automatically creating such things for users by default, another
>>>> option would be a single flag that opts-in to such auto-creation
>>>> (which could include other resources in the future).
>>>>
>>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com>
>>>> wrote:
>>>> >
>>>> > Hello all,
>>>> > I recently worked on a transform to load data into BigQuery by
>>>> writing files to GCS, and issuing Load File jobs to BQ. I did this for the
>>>> Python SDK[1].
>>>> >
>>>> > This option requires the user to provide a GCS bucket to write the
>>>> files:
>>>> >
>>>> > If the user provides a bucket to the transform, the SDK will use that
>>>> bucket.
>>>> > If the user does not provide a bucket:
>>>> >
>>>> > When running in Dataflow, the SDK will borrow the temp_location of
>>>> the pipeline.
>>>> > When running in other runners, the pipeline will fail.
>>>> >
>>>> > The Java SDK has had functionality for File Loads into BQ for a long
>>>> time; and particularly, when users do not provide a bucket, it attempts to
>>>> create a default bucket[2]; and this bucket is used as temp_location (which
>>>> then is used by the BQ File Loads transform).
>>>> >
>>>> > I do not really like creating GCS buckets on behalf of users. In
>>>> Java, the outcome is that users will not have to pass a --tempLocation
>>>> parameter when submitting jobs to Dataflow - which is a nice convenience,
>>>> but I'm not sure that this is in-line with users' expectations.
>>>> >
>>>> > Currently, the options are:
>>>> >
>>>> > Adding support for bucket autocreation for Python SDK
>>>> > Deprecating support for bucket autocreation in Java SDK, and printing
>>>> a warning.
>>>> >
>>>> > I am personally inclined for #1. But what do others think?
>>>> >
>>>> > Best
>>>> > -P.
>>>> >
>>>> > [1] https://github.com/apache/beam/pull/7892
>>>> > [2]
>>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>>>
>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Valentyn Tymofieiev <va...@google.com>.
+1 to have a consistent experience across SDKs, and do bucket creation by
default, specifically:
- Temp locations should be optional.
- Autocreation behavior should be documented.
- The messages ("using bucket X", or "creating bucket X since temp_location
is not specified") should be visible in console logs.
- Meaning of temp_location, staging_location, gcpTempLocation,
awsTempLocation, and whether they are required or optional,  should be
consistent across SDKs.


On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:

> Java SDK creates one regional bucket per project and region combination
> <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318>
> .
> So it's not a lot of buckets - no need to auto-clean.
>
> I agree with Robert that having less flags is better.
> Perhaps what we need a unifying interface for SDKs that simplifies
> launching?
>
> So instead of:
> mvn compile exec:java -Dexec.mainClass=<class>
> -Dexec.args="--runner=DataflowRunner --project=<project>
> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
> or
> python -m <module> --runner DataflowRunner --project
> <project> --temp_location gs://<bucket>/tmp/ <user flags>
>
> We could have:
> ./beam java run <class> --runner=DataflowRunner <user flags>
> ./beam python run <module> --runner=DataflowRunner <user flags>
>
> where GCP project and temp_location are optional.
>
> On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dc...@google.com>
> wrote:
>
>> I would go for #1 since it's a better user experience. Especially for new
>> users who don't understand every step involved on staging/deploying. It's
>> just another (unnecessary) mental concept they don't have to be aware of.
>> Anything that makes it closer to only providing the `--runner` flag without
>> any additional flags (by default, but configurable if necessary) is a good
>> thing in my opinion.
>>
>> AutoML already auto-creates a GCS bucket (not configurable, with a global
>> name which has its own downfalls). Other products are already doing this to
>> simplify user experience. I think as long as there's an explicit logging
>> statement it should be fine.
>>
>> If the bucket was not specified and was created: "No --temp_location
>> specified, created gs://..."
>>
>> If the bucket was not specified and was found: "No --temp_location
>> specified, found gs://..."
>>
>> If the bucket was specified, the logging could be omitted since it's
>> already explicit from the command line arguments.
>>
>> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> Do we clean up auto created GCS buckets ?
>>>
>>> If there's no good way to cleanup, I think it might be better to make
>>> this opt-in.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> I think having a single, default, auto-created temporary bucket per
>>>> project for use in GCP (when running on Dataflow, or running elsewhere
>>>> but using GCS such as for this BQ load files example), though not
>>>> ideal, is the best user experience. If we don't want to be
>>>> automatically creating such things for users by default, another
>>>> option would be a single flag that opts-in to such auto-creation
>>>> (which could include other resources in the future).
>>>>
>>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com>
>>>> wrote:
>>>> >
>>>> > Hello all,
>>>> > I recently worked on a transform to load data into BigQuery by
>>>> writing files to GCS, and issuing Load File jobs to BQ. I did this for the
>>>> Python SDK[1].
>>>> >
>>>> > This option requires the user to provide a GCS bucket to write the
>>>> files:
>>>> >
>>>> > If the user provides a bucket to the transform, the SDK will use that
>>>> bucket.
>>>> > If the user does not provide a bucket:
>>>> >
>>>> > When running in Dataflow, the SDK will borrow the temp_location of
>>>> the pipeline.
>>>> > When running in other runners, the pipeline will fail.
>>>> >
>>>> > The Java SDK has had functionality for File Loads into BQ for a long
>>>> time; and particularly, when users do not provide a bucket, it attempts to
>>>> create a default bucket[2]; and this bucket is used as temp_location (which
>>>> then is used by the BQ File Loads transform).
>>>> >
>>>> > I do not really like creating GCS buckets on behalf of users. In
>>>> Java, the outcome is that users will not have to pass a --tempLocation
>>>> parameter when submitting jobs to Dataflow - which is a nice convenience,
>>>> but I'm not sure that this is in-line with users' expectations.
>>>> >
>>>> > Currently, the options are:
>>>> >
>>>> > Adding support for bucket autocreation for Python SDK
>>>> > Deprecating support for bucket autocreation in Java SDK, and printing
>>>> a warning.
>>>> >
>>>> > I am personally inclined for #1. But what do others think?
>>>> >
>>>> > Best
>>>> > -P.
>>>> >
>>>> > [1] https://github.com/apache/beam/pull/7892
>>>> > [2]
>>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>>>
>>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Udi Meiri <eh...@google.com>.
Java SDK creates one regional bucket per project and region combination
<https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318>
.
So it's not a lot of buckets - no need to auto-clean.

I agree with Robert that having less flags is better.
Perhaps what we need a unifying interface for SDKs that simplifies
launching?

So instead of:
mvn compile exec:java -Dexec.mainClass=<class>
-Dexec.args="--runner=DataflowRunner --project=<project>
--gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
or
python -m <module> --runner DataflowRunner --project
<project> --temp_location gs://<bucket>/tmp/ <user flags>

We could have:
./beam java run <class> --runner=DataflowRunner <user flags>
./beam python run <module> --runner=DataflowRunner <user flags>

where GCP project and temp_location are optional.

On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dc...@google.com> wrote:

> I would go for #1 since it's a better user experience. Especially for new
> users who don't understand every step involved on staging/deploying. It's
> just another (unnecessary) mental concept they don't have to be aware of.
> Anything that makes it closer to only providing the `--runner` flag without
> any additional flags (by default, but configurable if necessary) is a good
> thing in my opinion.
>
> AutoML already auto-creates a GCS bucket (not configurable, with a global
> name which has its own downfalls). Other products are already doing this to
> simplify user experience. I think as long as there's an explicit logging
> statement it should be fine.
>
> If the bucket was not specified and was created: "No --temp_location
> specified, created gs://..."
>
> If the bucket was not specified and was found: "No --temp_location
> specified, found gs://..."
>
> If the bucket was specified, the logging could be omitted since it's
> already explicit from the command line arguments.
>
> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Do we clean up auto created GCS buckets ?
>>
>> If there's no good way to cleanup, I think it might be better to make
>> this opt-in.
>>
>> Thanks,
>> Cham
>>
>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> I think having a single, default, auto-created temporary bucket per
>>> project for use in GCP (when running on Dataflow, or running elsewhere
>>> but using GCS such as for this BQ load files example), though not
>>> ideal, is the best user experience. If we don't want to be
>>> automatically creating such things for users by default, another
>>> option would be a single flag that opts-in to such auto-creation
>>> (which could include other resources in the future).
>>>
>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com>
>>> wrote:
>>> >
>>> > Hello all,
>>> > I recently worked on a transform to load data into BigQuery by writing
>>> files to GCS, and issuing Load File jobs to BQ. I did this for the Python
>>> SDK[1].
>>> >
>>> > This option requires the user to provide a GCS bucket to write the
>>> files:
>>> >
>>> > If the user provides a bucket to the transform, the SDK will use that
>>> bucket.
>>> > If the user does not provide a bucket:
>>> >
>>> > When running in Dataflow, the SDK will borrow the temp_location of the
>>> pipeline.
>>> > When running in other runners, the pipeline will fail.
>>> >
>>> > The Java SDK has had functionality for File Loads into BQ for a long
>>> time; and particularly, when users do not provide a bucket, it attempts to
>>> create a default bucket[2]; and this bucket is used as temp_location (which
>>> then is used by the BQ File Loads transform).
>>> >
>>> > I do not really like creating GCS buckets on behalf of users. In Java,
>>> the outcome is that users will not have to pass a --tempLocation parameter
>>> when submitting jobs to Dataflow - which is a nice convenience, but I'm not
>>> sure that this is in-line with users' expectations.
>>> >
>>> > Currently, the options are:
>>> >
>>> > Adding support for bucket autocreation for Python SDK
>>> > Deprecating support for bucket autocreation in Java SDK, and printing
>>> a warning.
>>> >
>>> > I am personally inclined for #1. But what do others think?
>>> >
>>> > Best
>>> > -P.
>>> >
>>> > [1] https://github.com/apache/beam/pull/7892
>>> > [2]
>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>>
>>

Re: On Auto-creating GCS buckets on behalf of users

Posted by David Cavazos <dc...@google.com>.
I would go for #1 since it's a better user experience. Especially for new
users who don't understand every step involved on staging/deploying. It's
just another (unnecessary) mental concept they don't have to be aware of.
Anything that makes it closer to only providing the `--runner` flag without
any additional flags (by default, but configurable if necessary) is a good
thing in my opinion.

AutoML already auto-creates a GCS bucket (not configurable, with a global
name which has its own downfalls). Other products are already doing this to
simplify user experience. I think as long as there's an explicit logging
statement it should be fine.

If the bucket was not specified and was created: "No --temp_location
specified, created gs://..."

If the bucket was not specified and was found: "No --temp_location
specified, found gs://..."

If the bucket was specified, the logging could be omitted since it's
already explicit from the command line arguments.

On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <ch...@google.com>
wrote:

> Do we clean up auto created GCS buckets ?
>
> If there's no good way to cleanup, I think it might be better to make this
> opt-in.
>
> Thanks,
> Cham
>
> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> I think having a single, default, auto-created temporary bucket per
>> project for use in GCP (when running on Dataflow, or running elsewhere
>> but using GCS such as for this BQ load files example), though not
>> ideal, is the best user experience. If we don't want to be
>> automatically creating such things for users by default, another
>> option would be a single flag that opts-in to such auto-creation
>> (which could include other resources in the future).
>>
>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com> wrote:
>> >
>> > Hello all,
>> > I recently worked on a transform to load data into BigQuery by writing
>> files to GCS, and issuing Load File jobs to BQ. I did this for the Python
>> SDK[1].
>> >
>> > This option requires the user to provide a GCS bucket to write the
>> files:
>> >
>> > If the user provides a bucket to the transform, the SDK will use that
>> bucket.
>> > If the user does not provide a bucket:
>> >
>> > When running in Dataflow, the SDK will borrow the temp_location of the
>> pipeline.
>> > When running in other runners, the pipeline will fail.
>> >
>> > The Java SDK has had functionality for File Loads into BQ for a long
>> time; and particularly, when users do not provide a bucket, it attempts to
>> create a default bucket[2]; and this bucket is used as temp_location (which
>> then is used by the BQ File Loads transform).
>> >
>> > I do not really like creating GCS buckets on behalf of users. In Java,
>> the outcome is that users will not have to pass a --tempLocation parameter
>> when submitting jobs to Dataflow - which is a nice convenience, but I'm not
>> sure that this is in-line with users' expectations.
>> >
>> > Currently, the options are:
>> >
>> > Adding support for bucket autocreation for Python SDK
>> > Deprecating support for bucket autocreation in Java SDK, and printing a
>> warning.
>> >
>> > I am personally inclined for #1. But what do others think?
>> >
>> > Best
>> > -P.
>> >
>> > [1] https://github.com/apache/beam/pull/7892
>> > [2]
>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>
>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Chamikara Jayalath <ch...@google.com>.
Do we clean up auto created GCS buckets ?

If there's no good way to cleanup, I think it might be better to make this
opt-in.

Thanks,
Cham

On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <ro...@google.com> wrote:

> I think having a single, default, auto-created temporary bucket per
> project for use in GCP (when running on Dataflow, or running elsewhere
> but using GCS such as for this BQ load files example), though not
> ideal, is the best user experience. If we don't want to be
> automatically creating such things for users by default, another
> option would be a single flag that opts-in to such auto-creation
> (which could include other resources in the future).
>
> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com> wrote:
> >
> > Hello all,
> > I recently worked on a transform to load data into BigQuery by writing
> files to GCS, and issuing Load File jobs to BQ. I did this for the Python
> SDK[1].
> >
> > This option requires the user to provide a GCS bucket to write the files:
> >
> > If the user provides a bucket to the transform, the SDK will use that
> bucket.
> > If the user does not provide a bucket:
> >
> > When running in Dataflow, the SDK will borrow the temp_location of the
> pipeline.
> > When running in other runners, the pipeline will fail.
> >
> > The Java SDK has had functionality for File Loads into BQ for a long
> time; and particularly, when users do not provide a bucket, it attempts to
> create a default bucket[2]; and this bucket is used as temp_location (which
> then is used by the BQ File Loads transform).
> >
> > I do not really like creating GCS buckets on behalf of users. In Java,
> the outcome is that users will not have to pass a --tempLocation parameter
> when submitting jobs to Dataflow - which is a nice convenience, but I'm not
> sure that this is in-line with users' expectations.
> >
> > Currently, the options are:
> >
> > Adding support for bucket autocreation for Python SDK
> > Deprecating support for bucket autocreation in Java SDK, and printing a
> warning.
> >
> > I am personally inclined for #1. But what do others think?
> >
> > Best
> > -P.
> >
> > [1] https://github.com/apache/beam/pull/7892
> > [2]
> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>

Re: On Auto-creating GCS buckets on behalf of users

Posted by Robert Bradshaw <ro...@google.com>.
I think having a single, default, auto-created temporary bucket per
project for use in GCP (when running on Dataflow, or running elsewhere
but using GCS such as for this BQ load files example), though not
ideal, is the best user experience. If we don't want to be
automatically creating such things for users by default, another
option would be a single flag that opts-in to such auto-creation
(which could include other resources in the future).

On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pa...@google.com> wrote:
>
> Hello all,
> I recently worked on a transform to load data into BigQuery by writing files to GCS, and issuing Load File jobs to BQ. I did this for the Python SDK[1].
>
> This option requires the user to provide a GCS bucket to write the files:
>
> If the user provides a bucket to the transform, the SDK will use that bucket.
> If the user does not provide a bucket:
>
> When running in Dataflow, the SDK will borrow the temp_location of the pipeline.
> When running in other runners, the pipeline will fail.
>
> The Java SDK has had functionality for File Loads into BQ for a long time; and particularly, when users do not provide a bucket, it attempts to create a default bucket[2]; and this bucket is used as temp_location (which then is used by the BQ File Loads transform).
>
> I do not really like creating GCS buckets on behalf of users. In Java, the outcome is that users will not have to pass a --tempLocation parameter when submitting jobs to Dataflow - which is a nice convenience, but I'm not sure that this is in-line with users' expectations.
>
> Currently, the options are:
>
> Adding support for bucket autocreation for Python SDK
> Deprecating support for bucket autocreation in Java SDK, and printing a warning.
>
> I am personally inclined for #1. But what do others think?
>
> Best
> -P.
>
> [1] https://github.com/apache/beam/pull/7892
> [2] https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343