You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Martin Studer (JIRA)" <ji...@apache.org> on 2019/04/03 09:26:00 UTC

[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

    [ https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808530#comment-16808530 ] 

Martin Studer commented on SPARK-19860:
---------------------------------------

We're observing the same issue with pyspark 2.3.0. It happens on an inner join of two data frames which have one single column in common (the join column). If I rename one of the columns as mentioned by [~wuchang1989] and then use a join expression the join succeeds. Isolating the problem seems difficult as it happens only in the context of a larger pipeline.

> DataFrame join get conflict error if two frames has a same name column.
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19860
>                 URL: https://issues.apache.org/jira/browse/SPARK-19860
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: wuchang
>            Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
>     jdf = self._jdf.join(other._jdf, on._jc, how)
>   File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
>   File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount1#97]
> :  +- Filter (inorout#44 = A)
> :     +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
> :        +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> :           +- SubqueryAlias history_transfer_v
> :              +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51]
> :                 +- SubqueryAlias history_transfer
> :                    +- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet
> +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount2#145]
>    +- Filter (inorout#44 = B)
>       +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42]
>          +- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> 		 +- SubqueryAlias history_transfer_v
> 		 +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51]
> 		 +- SubqueryAlias history_transfer
> 		 +- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet
> 		 
> 		 Conflicting attributes: fdate#42
> {code}
> Only when I use .withColumnRenamed('fdate','fdate2') method to change df1's column fdate to fdate1 and df2's column fdate to fdate2 , the join is ok.
> So ,my question is ,why the conflict happened?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org