You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Tatiana Al-Chueyr Martins <ta...@gmail.com> on 2021/06/15 09:02:08 UTC

[Bug] Python Cookbook coders.py example raising PicklingError

Hello,

When trying to run the coders.py cookbook example [1], it raises an
exception

```
   PicklingError : _pickle.PicklingError: Can't pickle <class
'__main__.JsonCoder'>: it's not the same object as __main__.JsonCoder
[while running 'read/Read/Map(<lambda at iobase.py:899>)']
```

The example pipeline works when it's run using the coders_test.py file [2].
In the existing test, the DAG creates the input collection instead of
reading from a file.

I am using Python 3.7.9, the Python Beam SDK 2.29.0 and DirectRunner, under
Darwin.

To reproduce the issue, run:
```
  python coders.py --input input.ndjson --output output.txt
```

Where the input file (input.ndjson) has the same values as coders_test.py:
```
{"host": ["Germany", 1], "guest": ["Italy", 0]}
{"host": ["Germany", 1], "guest": ["Brasil", 3]}
{"host": ["Brasil", 1], "guest": ["Italy", 0]}
```

The full stack trace is at the bottom of the email.

Questions:
1. Am I doing something wrong?
2. Is a coder recommended to decode NDJSON files or is it advised to just
use ParDos / Maps to deserialize them?

I really appreciate any help you can provide.

Kind regards,

Tatiana

[1]
https://github.com/apache/beam/blob/35bac6a62f1dc548ee908cfeff7f73ffcac38e6f/sdks/python/apache_beam/examples/cookbook/coders.py
[2]
https://github.com/apache/beam/blob/35bac6a62f1dc548ee908cfeff7f73ffcac38e6f/sdks/python/apache_beam/examples/cookbook/coders_test.py


```
Traceback (most recent call last):

 File "apache_beam/runners/common.py", line 1233, in
apache_beam.runners.common.DoFnRunner.process

 File "apache_beam/runners/common.py", line 581, in
apache_beam.runners.common.SimpleInvoker.invoke_process

 File "apache_beam/runners/common.py", line 1395, in
apache_beam.runners.common._OutputProcessor.process_outputs

 File "apache_beam/runners/worker/operations.py", line 219, in
apache_beam.runners.worker.operations.SingletonConsumerSet.receive

 File "apache_beam/runners/worker/operations.py", line 183, in
apache_beam.runners.worker.operations.ConsumerSet.update_counters_start

 File "apache_beam/runners/worker/opcounters.py", line 217, in
apache_beam.runners.worker.opcounters.OperationCounters.update_from

 File "apache_beam/runners/worker/opcounters.py", line 255, in
apache_beam.runners.worker.opcounters.OperationCounters.do_sample

 File "apache_beam/coders/coder_impl.py", line 1311, in
apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables

 File "apache_beam/coders/coder_impl.py", line 1322, in
apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables

 File "apache_beam/coders/coder_impl.py", line 354, in
apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables

 File "apache_beam/coders/coder_impl.py", line 418, in
apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream

 File "apache_beam/coders/coder_impl.py", line 261, in
apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream

 File
"/private/tmp/dataflow_sandbox/lib/python3.7/site-packages/apache_beam/coders/coders.py",
line 749, in <lambda>

  lambda x: dumps(x, protocol), pickle.loads)

_pickle.PicklingError: Can't pickle <class '__main__.JsonCoder'>: it's not
the same object as __main__.JsonCoder
```

Re: [Bug] Python Cookbook coders.py example raising PicklingError

Posted by Valentyn Tymofieiev <va...@google.com>.
Hi Tatiana,

Thanks for a detailed description of an unusual behavior. Responses inline.

Best,
Valentyn

On Tue, Jun 15, 2021 at 2:02 AM Tatiana Al-Chueyr Martins <
tatiana.alchueyr@gmail.com> wrote:

> Hello,
>
> When trying to run the coders.py cookbook example [1], it raises an
> exception
>
> ```
>    PicklingError : _pickle.PicklingError: Can't pickle <class
> '__main__.JsonCoder'>: it's not the same object as __main__.JsonCoder
> [while running 'read/Read/Map(<lambda at iobase.py:899>)']
> ```
>
This is a bug. I opened https://issues.apache.org/jira/browse/BEAM-12573. I
tracked the bug down to  https://github.com/apache/beam/pull/13154, but
didn't have time to investigate further. I will also be out of office for
several weeks, so if anybody is interested to look further - feel free to
pick this up.

The example also needs https://github.com/apache/beam/pull/15125 to pass.
Also filed https://issues.apache.org/jira/browse/BEAM-12572 to make sure we
can catch such errors early.


>
> The example pipeline works when it's run using the coders_test.py file
> [2]. In the existing test, the DAG creates the input collection instead of
> reading from a file.
>
> I am using Python 3.7.9, the Python Beam SDK 2.29.0 and DirectRunner,
> under Darwin.
>
> To reproduce the issue, run:
> ```
>   python coders.py --input input.ndjson --output output.txt
> ```
>
> Where the input file (input.ndjson) has the same values as coders_test.py:
> ```
> {"host": ["Germany", 1], "guest": ["Italy", 0]}
> {"host": ["Germany", 1], "guest": ["Brasil", 3]}
> {"host": ["Brasil", 1], "guest": ["Italy", 0]}
> ```
>
> The full stack trace is at the bottom of the email.
>
> Questions:
> 1. Am I doing something wrong?
>
Thanks for reporting this bug.

> 2. Is a coder recommended to decode NDJSON files or is it advised to just
> use ParDos / Maps to deserialize them?
>
I think the example illustrates using a custom coder, which in this case
can be a convenient way to structure your pipeline, but you could also
serialize/deserialize the json objects yourself in with a ParDo/Map.

>
> I really appreciate any help you can provide.
>
> Kind regards,
>
> Tatiana
>
> [1]
> https://github.com/apache/beam/blob/35bac6a62f1dc548ee908cfeff7f73ffcac38e6f/sdks/python/apache_beam/examples/cookbook/coders.py
> [2]
> https://github.com/apache/beam/blob/35bac6a62f1dc548ee908cfeff7f73ffcac38e6f/sdks/python/apache_beam/examples/cookbook/coders_test.py
>
>
> ```
> Traceback (most recent call last):
>
>  File "apache_beam/runners/common.py", line 1233, in
> apache_beam.runners.common.DoFnRunner.process
>
>  File "apache_beam/runners/common.py", line 581, in
> apache_beam.runners.common.SimpleInvoker.invoke_process
>
>  File "apache_beam/runners/common.py", line 1395, in
> apache_beam.runners.common._OutputProcessor.process_outputs
>
>  File "apache_beam/runners/worker/operations.py", line 219, in
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>
>  File "apache_beam/runners/worker/operations.py", line 183, in
> apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
>
>  File "apache_beam/runners/worker/opcounters.py", line 217, in
> apache_beam.runners.worker.opcounters.OperationCounters.update_from
>
>  File "apache_beam/runners/worker/opcounters.py", line 255, in
> apache_beam.runners.worker.opcounters.OperationCounters.do_sample
>
>  File "apache_beam/coders/coder_impl.py", line 1311, in
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>
>  File "apache_beam/coders/coder_impl.py", line 1322, in
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>
>  File "apache_beam/coders/coder_impl.py", line 354, in
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables
>
>  File "apache_beam/coders/coder_impl.py", line 418, in
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
>
>  File "apache_beam/coders/coder_impl.py", line 261, in
> apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
>
>  File
> "/private/tmp/dataflow_sandbox/lib/python3.7/site-packages/apache_beam/coders/coders.py",
> line 749, in <lambda>
>
>   lambda x: dumps(x, protocol), pickle.loads)
>
> _pickle.PicklingError: Can't pickle <class '__main__.JsonCoder'>: it's not
> the same object as __main__.JsonCoder
> ```
>