You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saurabh (JIRA)" <ji...@apache.org> on 2017/05/29 07:53:04 UTC

[jira] [Commented] (SPARK-7182) [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

    [ https://issues.apache.org/jira/browse/SPARK-7182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028113#comment-16028113 ] 

Saurabh commented on SPARK-7182:
--------------------------------

Xiao Li.. I'm getting the same error when I try to do .dropDuplicates. Changing the alias cannot be a solution as , if I have 40 column in each dataframe its merely impossible to change the name of each. So I tried to do drop Dupicates but its giving the same  u"Reference 'xcol14x' is ambiguous exception


> [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7182
>                 URL: https://issues.apache.org/jira/browse/SPARK-7182
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.1
>            Reporter: Don Drake
>
> I'm having trouble saving a dataframe as parquet after performing a simple table join.
> Below is a trivial example that demonstrates the issue.
> The following is from a pyspark session:
> {code}
> d1=[{'a':1, 'b':2, 'c':3}]
> d2=[{'a':1, 'b':2, 'd':4}]
> t1 = sqlContext.createDataFrame(d1)
> t2 = sqlContext.createDataFrame(d2)
> j = t1.join(t2, t1.a==t2.a and t1.b==t2.b)
> >>> j
> DataFrame[a: bigint, b: bigint, c: bigint, a: bigint, b: bigint, d: bigint]
> {code}
> Try to get a unique list of the columns:
> {code}
> u = sorted(list(set(j.columns)))
> >>> nt = j.select(*u)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py", lin
> e 586, in select
>     jdf = self._jdf.select(self.sql_ctx._sc._jvm.PythonUtils.toSeq(jcols))
>   File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
> java_gateway.py", line 538, in __call__
>   File "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/
> protocol.py", line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o829.select.
> : org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#0L, a#3L
> .;
>     at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:2
> 29)
> {code}
> That didn't work, save the file (that works), but reading it back in fails.:
> {code}
> j.saveAsParquetFile('j')
> >>> z = sqlContext.parquetFile('j')
> >>> z.take(1)
> ...
> : An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 104.0 failed 1 times, most recent failure: Lost task 171.0 in stage 104.0 (TID 1235, localhost): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/drake/fd/spark/j/part-r-00172.parquet
> 	at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org