You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Ramana Venkata <vr...@kisi.io> on 2023/06/15 09:03:02 UTC

[Question] Change default file encoding in Dataflow runners

Hi,

I accidentally discovered that the default file encoding in my Dataflow
runners is ANSI_X3.4-1968. We expected it to be UTF-8, and as a result,
some of our data has been corrupted.

I came across this Stack Overflow answer (link:
https://stackoverflow.com/a/362006), but to the best of my knowledge, there
is no way to pass flags to the Java command in Dataflow runners.

I would appreciate your assistance in resolving this issue.

Let me know if you have any further questions!

-- 

Venkata Ramana

Senior Software Engineer

Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
<https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>

www.getkisi.com
<http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>

-- 
---
This email is confidential/privileged. If you're not the intended 
recipient, please delete it and notify us immediately; please do not 
copy/use/disclose it for any purpose, to anyone. Thank you!

Re: [Question] Change default file encoding in Dataflow runners

Posted by Bruno Volpato via user <us...@beam.apache.org>.
Hi Ramana,

Interesting -- I see it too when not using Runner v2.
Runner v2 shows UTF-8 as expected, but without it, I get ANSI_X3.4-1968 for
file.encoding.

I'd say it's probably undesired, but we'd need to look further.
Curious why it would cause data corruption. Are you relying
on Charset.defaultCharset for text operations?

Best,
Bruno



On Fri, Jun 16, 2023 at 12:11 AM Ramana Venkata <vr...@kisi.io> wrote:

> Hi Bruno,
>
> I have added a log statement in a DoFn. logger.info(System.getProperty('file.encoding'))
> and that showed ANSI as the file encoding. There isn't anything in our code
> that sets ANSI file encoding. I will check with Google Support.
>
>
> On Fri, Jun 16, 2023 at 7:27 AM Bruno Volpato via user <
> user@beam.apache.org> wrote:
>
>> Hi Ramana,
>>
>> Curious where you got ANSI_X3.4-1968 from -- I don't think there's any
>> trace of this encoding anywhere in Dataflow Workers (as far as I am aware
>> and looked around).
>> The default encoding for JVM is UTF-8, and Dataflow doesn't appear to set
>> it anywhere. I was able to check using:
>>
>> $ docker run -it --entrypoint '/bin/bash'
>> us-central1-artifactregistry.gcr.io/google.com/dataflow-containers/worker/v1beta3/beam_java11_sdk:2.48.0
>>
>> # jshell
>>
>> > System.getProperty("file.encoding");
>> $1 ==> "UTF-8"
>>
>>
>> If you can't figure out if your job is using ANSI, I'd suggest contacting
>> Google support and providing relevant job IDs so this can be looked at
>> further.
>> Best,
>> Bruno
>>
>>
>>
>> On Thu, Jun 15, 2023 at 5:03 AM Ramana Venkata <vr...@kisi.io> wrote:
>>
>>> Hi,
>>>
>>> I accidentally discovered that the default file encoding in my Dataflow
>>> runners is ANSI_X3.4-1968. We expected it to be UTF-8, and as a result,
>>> some of our data has been corrupted.
>>>
>>> I came across this Stack Overflow answer (link:
>>> https://stackoverflow.com/a/362006), but to the best of my knowledge,
>>> there is no way to pass flags to the Java command in Dataflow runners.
>>>
>>> I would appreciate your assistance in resolving this issue.
>>>
>>> Let me know if you have any further questions!
>>>
>>> --
>>>
>>> Venkata Ramana
>>>
>>> Senior Software Engineer
>>>
>>> Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
>>> <https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>
>>>
>>> www.getkisi.com
>>> <http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>
>>>
>>>
>>>
>>>
>>> ---
>>> This email is confidential/privileged. If you're not the intended
>>> recipient, please delete it and notify us immediately; please do not
>>> copy/use/disclose it for any purpose, to anyone. Thank you!
>>>
>>
>
> --
>
> Venkata Ramana
>
> Senior Software Engineer
>
> Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
> <https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>
>
> www.getkisi.com
> <http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>
>
>
>
>
> ---
> This email is confidential/privileged. If you're not the intended
> recipient, please delete it and notify us immediately; please do not
> copy/use/disclose it for any purpose, to anyone. Thank you!
>

Re: [Question] Change default file encoding in Dataflow runners

Posted by Ramana Venkata <vr...@kisi.io>.
Hi Bruno,

I have added a log statement in a DoFn.
logger.info(System.getProperty('file.encoding'))
and that showed ANSI as the file encoding. There isn't anything in our code
that sets ANSI file encoding. I will check with Google Support.


On Fri, Jun 16, 2023 at 7:27 AM Bruno Volpato via user <us...@beam.apache.org>
wrote:

> Hi Ramana,
>
> Curious where you got ANSI_X3.4-1968 from -- I don't think there's any
> trace of this encoding anywhere in Dataflow Workers (as far as I am aware
> and looked around).
> The default encoding for JVM is UTF-8, and Dataflow doesn't appear to set
> it anywhere. I was able to check using:
>
> $ docker run -it --entrypoint '/bin/bash'
> us-central1-artifactregistry.gcr.io/google.com/dataflow-containers/worker/v1beta3/beam_java11_sdk:2.48.0
>
> # jshell
>
> > System.getProperty("file.encoding");
> $1 ==> "UTF-8"
>
>
> If you can't figure out if your job is using ANSI, I'd suggest contacting
> Google support and providing relevant job IDs so this can be looked at
> further.
> Best,
> Bruno
>
>
>
> On Thu, Jun 15, 2023 at 5:03 AM Ramana Venkata <vr...@kisi.io> wrote:
>
>> Hi,
>>
>> I accidentally discovered that the default file encoding in my Dataflow
>> runners is ANSI_X3.4-1968. We expected it to be UTF-8, and as a result,
>> some of our data has been corrupted.
>>
>> I came across this Stack Overflow answer (link:
>> https://stackoverflow.com/a/362006), but to the best of my knowledge,
>> there is no way to pass flags to the Java command in Dataflow runners.
>>
>> I would appreciate your assistance in resolving this issue.
>>
>> Let me know if you have any further questions!
>>
>> --
>>
>> Venkata Ramana
>>
>> Senior Software Engineer
>>
>> Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
>> <https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>
>>
>> www.getkisi.com
>> <http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>
>>
>>
>>
>>
>> ---
>> This email is confidential/privileged. If you're not the intended
>> recipient, please delete it and notify us immediately; please do not
>> copy/use/disclose it for any purpose, to anyone. Thank you!
>>
>

-- 

Venkata Ramana

Senior Software Engineer

Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
<https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>

www.getkisi.com
<http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>

-- 
---
This email is confidential/privileged. If you're not the intended 
recipient, please delete it and notify us immediately; please do not 
copy/use/disclose it for any purpose, to anyone. Thank you!

Re: [Question] Change default file encoding in Dataflow runners

Posted by Bruno Volpato via user <us...@beam.apache.org>.
Hi Ramana,

Curious where you got ANSI_X3.4-1968 from -- I don't think there's any
trace of this encoding anywhere in Dataflow Workers (as far as I am aware
and looked around).
The default encoding for JVM is UTF-8, and Dataflow doesn't appear to set
it anywhere. I was able to check using:

$ docker run -it --entrypoint '/bin/bash'
us-central1-artifactregistry.gcr.io/google.com/dataflow-containers/worker/v1beta3/beam_java11_sdk:2.48.0

# jshell

> System.getProperty("file.encoding");
$1 ==> "UTF-8"


If you can't figure out if your job is using ANSI, I'd suggest contacting
Google support and providing relevant job IDs so this can be looked at
further.
Best,
Bruno



On Thu, Jun 15, 2023 at 5:03 AM Ramana Venkata <vr...@kisi.io> wrote:

> Hi,
>
> I accidentally discovered that the default file encoding in my Dataflow
> runners is ANSI_X3.4-1968. We expected it to be UTF-8, and as a result,
> some of our data has been corrupted.
>
> I came across this Stack Overflow answer (link:
> https://stackoverflow.com/a/362006), but to the best of my knowledge,
> there is no way to pass flags to the Java command in Dataflow runners.
>
> I would appreciate your assistance in resolving this issue.
>
> Let me know if you have any further questions!
>
> --
>
> Venkata Ramana
>
> Senior Software Engineer
>
> Kisi Inc, 45 Main Street, Suite 608, Brooklyn, NY 11201
> <https://maps.google.com/?q=45+Main+Street,+Suite+723,+%C2%A0Brooklyn,+NY+11201&entry=gmail&source=g>
>
> www.getkisi.com
> <http://getkisi.com/?utm_source=email&utm_medium=email&utm_campaign=email>
>
>
>
>
> ---
> This email is confidential/privileged. If you're not the intended
> recipient, please delete it and notify us immediately; please do not
> copy/use/disclose it for any purpose, to anyone. Thank you!
>