You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2018/01/15 10:09:49 UTC

Re: Inner join with the table itself

Hi Michael,

-dev +user

What's the query? How do you "fool spark"?

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <ms...@gmail.com>
wrote:

> Hi all,
>
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
>
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
>
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
>
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
>
> Thanks,
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Inner join with the table itself

Posted by Michael Shtelma <ms...@gmail.com>.

Hi Jacek,

Thank you for the workaround.
It is really working in this way:
pos.as("p1").join(pos.as("p2")).filter($"p1.POSITION_ID0"===$"p2.POSITION_ID")
I have checked, that in this way I get the same execution plan as for
the join with renamed columns.

Best,
Michael


On Mon, Jan 15, 2018 at 10:33 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi Michael,
>
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
>
> scala> val r1 = spark.range(1)
> r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>
> scala> r1.as("left").join(r1.as("right")).filter($"left.id" ===
> $"right.id").show
> +---+---+
> | id| id|
> +---+---+
> |  0|  0|
> +---+---+
>
> Am I missing something? When aliasing a table, use the identifier in column
> refs (inside).
>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma <ms...@gmail.com> wrote:
>>
>> Hi Jacek & Gengliang,
>>
>> let's take a look at the following query:
>>
>> val pos = spark.read.parquet(prefix + "POSITION.parquet")
>> pos.createOrReplaceTempView("POSITION")
>> spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
>> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
>> ").collect()
>>
>> This query is working for me right now using spark 2.2.
>>
>> Now we can try implementing the same logic with DataFrame API:
>>
>> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()
>>
>> I am getting the following error:
>>
>> "Join condition is missing or trivial.
>>
>> Use the CROSS JOIN syntax to allow cartesian products between these
>> relations.;"
>>
>> I have tried using alias function, but without success:
>>
>> val pos2 = pos.alias("P2")
>> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()
>>
>> This also leads us to the same error.
>> Am  I missing smth about the usage of alias?
>>
>> Now let's rename the columns:
>>
>> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
>> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()
>>
>> It works!
>>
>> There is one more really odd thing about all this: a colleague of mine
>> has managed to get the same exception ("Join condition is missing or
>> trivial") also using original SQL query, but I think he has been using
>> empty tables.
>>
>> Thanks,
>> Michael
>>
>>
>> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
>> <ge...@databricks.com> wrote:
>> > Hi Michael,
>> >
>> > You can use `Explain` to see how your query is optimized.
>> >
>> > https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
>> > I believe your query is an actual cross join, which is usually very slow
>> > in
>> > execution.
>> >
>> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>> >
>> >
>> > 在 2018年1月15日，下午6:09，Jacek Laskowski <ja...@japila.pl> 写道：
>> >
>> > Hi Michael,
>> >
>> > -dev +user
>> >
>> > What's the query? How do you "fool spark"?
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > ----
>> > https://about.me/JacekLaskowski
>> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <ms...@gmail.com>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> If I try joining the table with itself using join columns, I am
>> >> getting the following error:
>> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> >> allow cartesian products between these relations.;"
>> >>
>> >> This is not true, and my join is not trivial and is not a real cross
>> >> join. I am providing join condition and expect to get maybe a couple
>> >> of joined rows for each row in the original table.
>> >>
>> >> There is a workaround for this, which implies renaming all the columns
>> >> in source data frame and only afterwards proceed with the join. This
>> >> allows us to fool spark.
>> >>
>> >> Now I am wondering if there is a way to get rid of this problem in a
>> >> better way? I do not like the idea of renaming the columns because
>> >> this makes it really difficult to keep track of the names in the
>> >> columns in result data frames.
>> >> Is it possible to deactivate this check?
>> >>
>> >> Thanks,
>> >> Michael
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Inner join with the table itself

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Michael,

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> val r1 = spark.range(1)
r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> r1.as("left").join(r1.as("right")).filter($"left.id" === $"right.id
").show
+---+---+
| id| id|
+---+---+
|  0|  0|
+---+---+

Am I missing something? When aliasing a table, use the identifier in column
refs (inside).


Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma <ms...@gmail.com> wrote:

> Hi Jacek & Gengliang,
>
> let's take a look at the following query:
>
> val pos = spark.read.parquet(prefix + "POSITION.parquet")
> pos.createOrReplaceTempView("POSITION")
> spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
> ").collect()
>
> This query is working for me right now using spark 2.2.
>
> Now we can try implementing the same logic with DataFrame API:
>
> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()
>
> I am getting the following error:
>
> "Join condition is missing or trivial.
>
> Use the CROSS JOIN syntax to allow cartesian products between these
> relations.;"
>
> I have tried using alias function, but without success:
>
> val pos2 = pos.alias("P2")
> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()
>
> This also leads us to the same error.
> Am  I missing smth about the usage of alias?
>
> Now let's rename the columns:
>
> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()
>
> It works!
>
> There is one more really odd thing about all this: a colleague of mine
> has managed to get the same exception ("Join condition is missing or
> trivial") also using original SQL query, but I think he has been using
> empty tables.
>
> Thanks,
> Michael
>
>
> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
> <ge...@databricks.com> wrote:
> > Hi Michael,
> >
> > You can use `Explain` to see how your query is optimized.
> > https://docs.databricks.com/spark/latest/spark-sql/
> language-manual/explain.html
> > I believe your query is an actual cross join, which is usually very slow
> in
> > execution.
> >
> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
> >
> >
> > 在 2018年1月15日，下午6:09，Jacek Laskowski <ja...@japila.pl> 写道：
> >
> > Hi Michael,
> >
> > -dev +user
> >
> > What's the query? How do you "fool spark"?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > ----
> > https://about.me/JacekLaskowski
> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <ms...@gmail.com>
> > wrote:
> >>
> >> Hi all,
> >>
> >> If I try joining the table with itself using join columns, I am
> >> getting the following error:
> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> >> allow cartesian products between these relations.;"
> >>
> >> This is not true, and my join is not trivial and is not a real cross
> >> join. I am providing join condition and expect to get maybe a couple
> >> of joined rows for each row in the original table.
> >>
> >> There is a workaround for this, which implies renaming all the columns
> >> in source data frame and only afterwards proceed with the join. This
> >> allows us to fool spark.
> >>
> >> Now I am wondering if there is a way to get rid of this problem in a
> >> better way? I do not like the idea of renaming the columns because
> >> this makes it really difficult to keep track of the names in the
> >> columns in result data frames.
> >> Is it possible to deactivate this check?
> >>
> >> Thanks,
> >> Michael
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
> >
> >
>

Re: Inner join with the table itself

Posted by Michael Shtelma <ms...@gmail.com>.

Hi Jacek & Gengliang,

let's take a look at the following query:

val pos = spark.read.parquet(prefix + "POSITION.parquet")
pos.createOrReplaceTempView("POSITION")
spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
").collect()

This query is working for me right now using spark 2.2.

Now we can try implementing the same logic with DataFrame API:

pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()

I am getting the following error:

"Join condition is missing or trivial.

Use the CROSS JOIN syntax to allow cartesian products between these relations.;"

I have tried using alias function, but without success:

val pos2 = pos.alias("P2")
pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()

This also leads us to the same error.
Am  I missing smth about the usage of alias?

Now let's rename the columns:

val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()

It works!

There is one more really odd thing about all this: a colleague of mine
has managed to get the same exception ("Join condition is missing or
trivial") also using original SQL query, but I think he has been using
empty tables.

Thanks,
Michael


On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
<ge...@databricks.com> wrote:
> Hi Michael,
>
> You can use `Explain` to see how your query is optimized.
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
> I believe your query is an actual cross join, which is usually very slow in
> execution.
>
> To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>
>
> 在 2018年1月15日，下午6:09，Jacek Laskowski <ja...@japila.pl> 写道：
>
> Hi Michael,
>
> -dev +user
>
> What's the query? How do you "fool spark"?
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <ms...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> If I try joining the table with itself using join columns, I am
>> getting the following error:
>> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> allow cartesian products between these relations.;"
>>
>> This is not true, and my join is not trivial and is not a real cross
>> join. I am providing join condition and expect to get maybe a couple
>> of joined rows for each row in the original table.
>>
>> There is a workaround for this, which implies renaming all the columns
>> in source data frame and only afterwards proceed with the join. This
>> allows us to fool spark.
>>
>> Now I am wondering if there is a way to get rid of this problem in a
>> better way? I do not like the idea of renaming the columns because
>> this makes it really difficult to keep track of the names in the
>> columns in result data frames.
>> Is it possible to deactivate this check?
>>
>> Thanks,
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Inner join with the table itself

Posted by Gengliang Wang <ge...@databricks.com>.

Hi Michael,

You can use `Explain` to see how your query is optimized. 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html <https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html>
I believe your query is an actual cross join, which is usually very slow in execution.

To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.


> 在 2018年1月15日，下午6:09，Jacek Laskowski <ja...@japila.pl> 写道：
> 
> Hi Michael,
> 
> -dev +user
> 
> What's the query? How do you "fool spark"?
> 
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski <https://about.me/JacekLaskowski>
> Mastering Spark SQL https://bit.ly/mastering-spark-sql <https://bit.ly/mastering-spark-sql>
> Spark Structured Streaming https://bit.ly/spark-structured-streaming <https://bit.ly/spark-structured-streaming>
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams <https://bit.ly/mastering-kafka-streams>
> Follow me at https://twitter.com/jaceklaskowski
>  <https://twitter.com/jaceklaskowski>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <mshtelma@gmail.com <ma...@gmail.com>> wrote:
> Hi all,
> 
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
> 
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
> 
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
> 
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
> 
> Thanks,
> Michael
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
>