You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Jérémie Bigras-Dunberry (Jira)" <ji...@apache.org> on 2021/07/31 20:08:00 UTC

[jira] [Created] (BEAM-12701) Converting two dataframe to_csv in the same pipeline causes PCollection label collision

Jérémie Bigras-Dunberry created BEAM-12701:
----------------------------------------------

             Summary: Converting two dataframe  to_csv in the same pipeline causes PCollection label collision
                 Key: BEAM-12701
                 URL: https://issues.apache.org/jira/browse/BEAM-12701
             Project: Beam
          Issue Type: Bug
          Components: io-py-common
    Affects Versions: 2.31.0
            Reporter: Jérémie Bigras-Dunberry


 

If you use  the to_csv of the DeferredDataFrame twice in a single pipeline like this : 
{code:java}
df1 = pd.DataFrame.from_records({"a":"b"}, index=[0])
df2 = pd.DataFrame.from_records({"a":"b"}, index=[0])

with beam.Pipeline() as p:
 df1 = to_dataframe(to_pcollection(df1, pipeline=p), label="df1")
 df2 = to_dataframe(to_pcollection(df2, pipeline=p), label="df2")

 df1.to_csv("test.csv")
 df2.to_csv("test2.csv"){code}
You get this error on the second to_csv call

 
{code:java}
RuntimeError: A transform with label "ToPCollection(df)" already exists in the pipeline. To apply a transform with a specified label write pvalue | "label" >> transform

{code}

I think it comes from the fact that to_csv  is calling a  to_pcollection without any label, causing to infer an identical label for both to_csv function calls. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)