You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kyle Weaver (Jira)" <ji...@apache.org> on 2020/05/08 19:13:00 UTC

[jira] [Issue Comment Deleted] (BEAM-9835) test_multimap_multiside_input failing on Spark Python

     [ https://issues.apache.org/jira/browse/BEAM-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kyle Weaver updated BEAM-9835:
------------------------------
    Comment: was deleted

(was: I am leaving this as a starter task for an incoming intern.

The failure can be reproduced by running the following command in your local Beam repo:

./gradlew :sdks:python:test-suites:portable:py2:sparkValidatesRunner -Ptests="test_multimap_multiside_input"

This test uses the same PCollection as a side input multiple times. The reason this test fails is that, since the Spark portable runner keys broadcasts [1] by PCollection ID, we end up with duplicate keys. Since PCollections are immutable, it is only necessary to broadcast a PCollection once, no matter how many times it is used as a side input.

Extra credit: does this same bug affect the classic Spark runner? If so, that should be fixed as well.

[1] https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/broadcast/Broadcast.html)

> test_multimap_multiside_input failing on Spark Python
> -----------------------------------------------------
>
>                 Key: BEAM-9835
>                 URL: https://issues.apache.org/jira/browse/BEAM-9835
>             Project: Beam
>          Issue Type: Bug
>          Components: test-failures
>            Reporter: Kyle Weaver
>            Priority: Major
>
> beam_PostCommit_Python_VR_Spark is red.
> 18:32:46 ERROR: test_multimap_multiside_input (__main__.SparkRunnerTest)
> 18:32:46 ----------------------------------------------------------------------
> 18:32:46 Traceback (most recent call last):
> 18:32:46   File "apache_beam/runners/portability/fn_api_runner/fn_runner_test.py", line 265, in test_multimap_multiside_input
> 18:32:46     equal_to([('a', [1, 3], [1, 2, 3]), ('b', [2], [1, 2, 3])]))
> 18:32:46   File "apache_beam/pipeline.py", line 529, in __exit__
> 18:32:46     self.run().wait_until_finish()
> 18:32:46   File "apache_beam/runners/portability/portable_runner.py", line 571, in wait_until_finish
> 18:32:46     (self._job_id, self._state, self._last_error_message()))
> 18:32:46 RuntimeError: Pipeline test_multimap_multiside_input_1588026700.62_3808162b-fc6a-4eb0-be3a-3efd819560f7 failed in state FAILED: java.lang.IllegalArgumentException: Multiple entries with same key: ref_PCollection_PCollection_21=(Broadcast(37),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder)) and ref_PCollection_PCollection_21=(Broadcast(36),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)