You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kyle Weaver (Jira)" <ji...@apache.org> on 2020/05/08 18:56:00 UTC
[jira] [Commented] (BEAM-9835) test_multimap_multiside_input
failing on Spark Python
[ https://issues.apache.org/jira/browse/BEAM-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102843#comment-17102843 ]
Kyle Weaver commented on BEAM-9835:
-----------------------------------
I am leaving this as a starter task for an incoming intern.
The failure can be reproduced by running the following command in your local Beam repo:
./gradlew :sdks:python:test-suites:portable:py2:sparkValidatesRunner -Ptests="test_multimap_multiside_input"
This test uses the same PCollection as a side input multiple times. The reason this test fails is that, since the Spark portable runner keys broadcasts [1] by PCollection ID, we end up with duplicate keys. Since PCollections are immutable, it is only necessary to broadcast a PCollection once, no matter how many times it is used as a side input.
Extra credit: does this same bug affect the classic Spark runner? If so, that should be fixed as well.
[1] https://spark.apache.org/docs/2.4.5/api/java/org/apache/spark/broadcast/Broadcast.html
> test_multimap_multiside_input failing on Spark Python
> -----------------------------------------------------
>
> Key: BEAM-9835
> URL: https://issues.apache.org/jira/browse/BEAM-9835
> Project: Beam
> Issue Type: Bug
> Components: test-failures
> Reporter: Kyle Weaver
> Priority: Major
>
> beam_PostCommit_Python_VR_Spark is red.
> 18:32:46 ERROR: test_multimap_multiside_input (__main__.SparkRunnerTest)
> 18:32:46 ----------------------------------------------------------------------
> 18:32:46 Traceback (most recent call last):
> 18:32:46 File "apache_beam/runners/portability/fn_api_runner/fn_runner_test.py", line 265, in test_multimap_multiside_input
> 18:32:46 equal_to([('a', [1, 3], [1, 2, 3]), ('b', [2], [1, 2, 3])]))
> 18:32:46 File "apache_beam/pipeline.py", line 529, in __exit__
> 18:32:46 self.run().wait_until_finish()
> 18:32:46 File "apache_beam/runners/portability/portable_runner.py", line 571, in wait_until_finish
> 18:32:46 (self._job_id, self._state, self._last_error_message()))
> 18:32:46 RuntimeError: Pipeline test_multimap_multiside_input_1588026700.62_3808162b-fc6a-4eb0-be3a-3efd819560f7 failed in state FAILED: java.lang.IllegalArgumentException: Multiple entries with same key: ref_PCollection_PCollection_21=(Broadcast(37),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder)) and ref_PCollection_PCollection_21=(Broadcast(36),WindowedValue$FullWindowedValueCoder(KvCoder(ByteArrayCoder,VarLongCoder),GlobalWindow$Coder))
--
This message was sent by Atlassian Jira
(v8.3.4#803005)