You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/03/27 03:13:00 UTC
[jira] [Resolved] (SPARK-31186) toPandas fails on simple query (collect() works)

     [ https://issues.apache.org/jira/browse/SPARK-31186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-31186.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 28025
[https://github.com/apache/spark/pull/28025]

> toPandas fails on simple query (collect() works)
> ------------------------------------------------
>
>                 Key: SPARK-31186
>                 URL: https://issues.apache.org/jira/browse/SPARK-31186
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.4
>            Reporter: Michael Chirico
>            Assignee: L. C. Hsieh
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> My pandas is 0.25.1.
> I ran the following simple code (cross joins are enabled):
> {code:python}
> spark.sql('''
> select t1.*, t2.* from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').toPandas()
> {code}
> and got a ValueError from pandas:
> > ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
> Collect works fine:
> {code:python}
> spark.sql('''
> select * from (
>   select explode(sequence(1, 3)) v
> ) t1 left join (
>   select explode(sequence(1, 3)) v
> ) t2
> ''').collect()
> # [Row(v=1, v=1),
> #  Row(v=1, v=2),
> #  Row(v=1, v=3),
> #  Row(v=2, v=1),
> #  Row(v=2, v=2),
> #  Row(v=2, v=3),
> #  Row(v=3, v=1),
> #  Row(v=3, v=2),
> #  Row(v=3, v=3)]
> {code}
> I imagine it's related to the duplicate column names, but this doesn't fail:
> {code:python}
> spark.sql("select 1 v, 1 v").toPandas()
> # v	v
> # 0	1	1
> {code}
> Also no issue for multiple rows:
> spark.sql("select 1 v, 1 v union all select 1 v, 2 v").toPandas()
> It also works when not using a cross join but a janky programatically-generated union all query:
> {code:python}
> cond = []
> for ii in range(3):
>     for jj in range(3):
>         cond.append(f'select {ii+1} v, {jj+1} v')
> spark.sql(' union all '.join(cond)).toPandas()
> {code}
> As near as I can tell, the output is identical to the explode output, making this issue all the more peculiar, as I thought toPandas() is applied to the output of collect(), so if collect() gives the same output, how can toPandas() fail in one case and not the other? Further, the lazy DataFrame is the same: DataFrame[v: int, v: int] in both cases. I must be missing something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org