You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Jérémie Bigras-Dunberry (Jira)" <ji...@apache.org> on 2021/07/31 20:08:00 UTC
[jira] [Created] (BEAM-12701) Converting two dataframe to_csv in
the same pipeline causes PCollection label collision
Jérémie Bigras-Dunberry created BEAM-12701:
----------------------------------------------
Summary: Converting two dataframe to_csv in the same pipeline causes PCollection label collision
Key: BEAM-12701
URL: https://issues.apache.org/jira/browse/BEAM-12701
Project: Beam
Issue Type: Bug
Components: io-py-common
Affects Versions: 2.31.0
Reporter: Jérémie Bigras-Dunberry
If you use the to_csv of the DeferredDataFrame twice in a single pipeline like this :
{code:java}
df1 = pd.DataFrame.from_records({"a":"b"}, index=[0])
df2 = pd.DataFrame.from_records({"a":"b"}, index=[0])
with beam.Pipeline() as p:
df1 = to_dataframe(to_pcollection(df1, pipeline=p), label="df1")
df2 = to_dataframe(to_pcollection(df2, pipeline=p), label="df2")
df1.to_csv("test.csv")
df2.to_csv("test2.csv"){code}
You get this error on the second to_csv call
{code:java}
RuntimeError: A transform with label "ToPCollection(df)" already exists in the pipeline. To apply a transform with a specified label write pvalue | "label" >> transform
{code}
I think it comes from the fact that to_csv is calling a to_pcollection without any label, causing to infer an identical label for both to_csv function calls.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)