You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Jeff Klukas <jk...@mozilla.com> on 2019/08/30 15:13:10 UTC

Setting environment and system properties on Dataflow workers

I just spent the past two days debugging a character corruption issue in a
Dataflow pipeline. It turned out that we had encoded a json object to a
string and then called getBytes() without specifying a charset. In our
testing infrastructure, this didn't cause a problem because the default
charset on the system was UTF-8. Whatever the default charset is on
Dataflow workers, it is apparently not UTF-8.

The main lesson here is to be very careful about always specifying a
charset when encoding and decoding strings. But, it would be nice to
protect ourselves from this problem in the future.

Is there any way for users to specify environment variables and/or Java
system properties when deploying a pipeline to Dataflow such that those
settings are in effect on all workers? I'd like to ensure UTF-8 is the
default charset throughout the pipeline on any system.

Re: Setting environment and system properties on Dataflow workers

Posted by Lukasz Cwik <lc...@google.com>.
Thanks for the feedback. Most people have been interested in how to solve
their business use case and most of the current effort has been about
adding details of how to implement common patterns for those business use
cases (such as a slowly changing side input in a streaming pipeline) so
adding examples/documentation for specific flags or hooks have taken a
second place to those requests.

On Fri, Aug 30, 2019 at 11:01 AM Jeff Klukas <jk...@mozilla.com> wrote:

> Thanks so much for the links. I expect those will get me where I need to
> be.
>
> That said, I don't know how a user would discover JvmInitializer without
> asking on the list. Does this seem worth adding to the Beam programming
> guide? It seems potentially too far down in the weeds for that guide and
> it's also mostly Dataflow-specific. I'd love to see more documented on the
> Dataflow worker environment in Google's docs, so perhaps that would bTe the
> best place for this.
>
> On Fri, Aug 30, 2019 at 1:53 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> There is a way to run arbitrary code on JVM startup via a JVM
>> initializer[1] in the Dataflow worker and in the portable Java worker as
>> well.
>>
>> You should be able to mutate system properties at that point in time
>> since Java allows for system properties to be mutated. The standard Java
>> runtime doesn't provide hooks to edit the environment variables and you
>> have to resort to some hackery that is JVM version dependent[2].
>>
>> 1:
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
>> 2: https://blog.sebastian-daschner.com/entries/changing_env_java
>>
>> On Fri, Aug 30, 2019 at 8:13 AM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> I just spent the past two days debugging a character corruption issue in
>>> a Dataflow pipeline. It turned out that we had encoded a json object to a
>>> string and then called getBytes() without specifying a charset. In our
>>> testing infrastructure, this didn't cause a problem because the default
>>> charset on the system was UTF-8. Whatever the default charset is on
>>> Dataflow workers, it is apparently not UTF-8.
>>>
>>> The main lesson here is to be very careful about always specifying a
>>> charset when encoding and decoding strings. But, it would be nice to
>>> protect ourselves from this problem in the future.
>>>
>>> Is there any way for users to specify environment variables and/or Java
>>> system properties when deploying a pipeline to Dataflow such that those
>>> settings are in effect on all workers? I'd like to ensure UTF-8 is the
>>> default charset throughout the pipeline on any system.
>>>
>>>

Re: Setting environment and system properties on Dataflow workers

Posted by Jeff Klukas <jk...@mozilla.com>.
Thanks so much for the links. I expect those will get me where I need to be.

That said, I don't know how a user would discover JvmInitializer without
asking on the list. Does this seem worth adding to the Beam programming
guide? It seems potentially too far down in the weeds for that guide and
it's also mostly Dataflow-specific. I'd love to see more documented on the
Dataflow worker environment in Google's docs, so perhaps that would bTe the
best place for this.

On Fri, Aug 30, 2019 at 1:53 PM Lukasz Cwik <lc...@google.com> wrote:

> There is a way to run arbitrary code on JVM startup via a JVM
> initializer[1] in the Dataflow worker and in the portable Java worker as
> well.
>
> You should be able to mutate system properties at that point in time since
> Java allows for system properties to be mutated. The standard Java runtime
> doesn't provide hooks to edit the environment variables and you have to
> resort to some hackery that is JVM version dependent[2].
>
> 1:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
> 2: https://blog.sebastian-daschner.com/entries/changing_env_java
>
> On Fri, Aug 30, 2019 at 8:13 AM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> I just spent the past two days debugging a character corruption issue in
>> a Dataflow pipeline. It turned out that we had encoded a json object to a
>> string and then called getBytes() without specifying a charset. In our
>> testing infrastructure, this didn't cause a problem because the default
>> charset on the system was UTF-8. Whatever the default charset is on
>> Dataflow workers, it is apparently not UTF-8.
>>
>> The main lesson here is to be very careful about always specifying a
>> charset when encoding and decoding strings. But, it would be nice to
>> protect ourselves from this problem in the future.
>>
>> Is there any way for users to specify environment variables and/or Java
>> system properties when deploying a pipeline to Dataflow such that those
>> settings are in effect on all workers? I'd like to ensure UTF-8 is the
>> default charset throughout the pipeline on any system.
>>
>>

Re: Setting environment and system properties on Dataflow workers

Posted by Lukasz Cwik <lc...@google.com>.
There is a way to run arbitrary code on JVM startup via a JVM
initializer[1] in the Dataflow worker and in the portable Java worker as
well.

You should be able to mutate system properties at that point in time since
Java allows for system properties to be mutated. The standard Java runtime
doesn't provide hooks to edit the environment variables and you have to
resort to some hackery that is JVM version dependent[2].

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/harness/JvmInitializer.java
2: https://blog.sebastian-daschner.com/entries/changing_env_java

On Fri, Aug 30, 2019 at 8:13 AM Jeff Klukas <jk...@mozilla.com> wrote:

> I just spent the past two days debugging a character corruption issue in a
> Dataflow pipeline. It turned out that we had encoded a json object to a
> string and then called getBytes() without specifying a charset. In our
> testing infrastructure, this didn't cause a problem because the default
> charset on the system was UTF-8. Whatever the default charset is on
> Dataflow workers, it is apparently not UTF-8.
>
> The main lesson here is to be very careful about always specifying a
> charset when encoding and decoding strings. But, it would be nice to
> protect ourselves from this problem in the future.
>
> Is there any way for users to specify environment variables and/or Java
> system properties when deploying a pipeline to Dataflow such that those
> settings are in effect on all workers? I'd like to ensure UTF-8 is the
> default charset throughout the pipeline on any system.
>
>