You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "boyingking@163.com" <bo...@163.com> on 2014/09/10 12:06:46 UTC

Spark SQL -- more than two tables for join

Hi:
I Hava a question about Spark SQL。

First ,i use left join on two tables,like this:
sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey)").collect().foreach(println)  
the result is my except。
But,when i use left join on three tables or more ,like this:
    sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)").collect().foreach(println) 
I take the Exception:
Exception in thread "main" java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found

SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)
                                                                                         ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
at org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
at org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)


My Question IS:
1、Whether My SQL script error or Spark SQL not support more than two tables for join?
2、If Saprk SQL not support more than two tables, How Can I do for my requirment?






boyingking@163.com

Re: Spark SQL -- more than two tables for join

Posted by Matei Zaharia <ma...@gmail.com>.
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

On Oct 7, 2014, at 7:20 AM, Gen <ge...@gmail.com> wrote:

> Hi, in fact, the same problem happens when I try several joins together:
> 
> SELECT * 
> FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
> INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
> eans.FORM_KEY)
> 
> py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
> : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
> `INNER' found
> 
> SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
> STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
> Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
> JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
> (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)
> 
>        at scala.sys.package$.error(package.scala:27)
>        at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
>        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
>        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>        at java.lang.reflect.Method.invoke(Method.java:606)
>        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>        at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>        at py4j.Gateway.invoke(Gateway.java:259)
>        at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>        at py4j.commands.CallCommand.execute(CallCommand.java:79)
>        at py4j.GatewayConnection.run(GatewayConnection.java:207)
>        at java.lang.Thread.run(Thread.java:745)
> 
> I use spark 1.1.0, so I have an impression that sparksql doesn't support
> several joins together. 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark SQL -- more than two tables for join

Posted by Gen <ge...@gmail.com>.
Hi, in fact, the same problem happens when I try several joins together:

SELECT * 
FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)

py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' found

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
(sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)

        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)

I use spark 1.1.0, so I have an impression that sparksql doesn't support
several joins together. 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark SQL -- more than two tables for join

Posted by TANG Gen <ge...@keyrus.com>.
Hi, the same problem happens when I try several joins together, such as
'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY
INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
eans.FORM_KEY)'

The error information is as follow: 
py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
: java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
`INNER' fo                                                                                                                    
und

SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
STO_KEY,                                                                                                                     
sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS Total,
magasin.F                                                                                                                    
ORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER JOIN magasin ON
sale                                                                                                                    
s.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY =
eans.BARC_KEY a                                                                                                                    
nd magasin.FORM_KEY = eans.FORM_KEY)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
^
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.                                                                                                                    
java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces                                                                                                                    
sorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)


I have an impression that sparksql doesn't support more than two joins



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15847.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Re: Spark SQL -- more than two tables for join

Posted by Yin Huai <hu...@gmail.com>.
1.0.1 does not have the support on outer joins (added in 1.1). Can you try
1.1 branch?

On Wed, Sep 10, 2014 at 9:28 PM, boyingking@163.com <bo...@163.com>
wrote:

>  Hi,michael :
>
> I think Arthur.hk.chan <ar...@gmail.com> isn't here now,I Can
> Show something:
> 1)my spark version is 1.0.1
> 2) when I use multiple join ,like this:
> sql("SELECT * FROM youhao_data left join youhao_age on
> (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
> (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)")
>
>        youhao_data,youhao_age,youhao_totalKiloMeter  were registerAsTable 。
>
>  I take the Exception:
>  Exception in thread "main" java.lang.RuntimeException: [1.90] failure:
> ``UNION'' expected but `left' found
>
> SELECT * FROM youhao_data left join youhao_age on
> (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
> (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)
>
> ^
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
> at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
> at
> org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
> at
> org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)
> ------------------------------
>  boyingking@163.com
>
>  *From:* Michael Armbrust <mi...@databricks.com>
> *Date:* 2014-09-11 00:28
> *To:* Arthur.hk.chan@gmail.com <ar...@gmail.com>
> *CC:* arunshell87 <sh...@gmail.com>; user@spark.incubator.apache.org
> *Subject:* Re: Spark SQL -- more than two tables for join
>    What version of Spark SQL are you running here?  I think a lot of your
> concerns have likely been addressed in more recent versions of the code /
> documentation.  (Spark 1.1 should be published in the next few days)
>
> In particular, for serious applications you should use a HiveContext and
> HiveQL as this is a much more complete implementation of a SQL Parser.  The
> one in SQL context is only suggested if the Hive dependencies conflict with
> your application.
>
>
>> 1)  spark sql does not support multiple join
>>
>
> This is not true.  What problem were you running into?
>
>
>> 2)  spark left join: has performance issue
>>
>
> Can you describe your data and query more?
>
>
>> 3)  spark sql’s cache table: does not support two-tier query
>>
>
> I'm not sure what you mean here.
>
>
>> 4)  spark sql does not support repartition
>
>
> You can repartition SchemaRDDs in the same way as normal RDDs.
>

Re: Re: Spark SQL -- more than two tables for join

Posted by "boyingking@163.com" <bo...@163.com>.
Hi,michael :

I think Arthur.hk.chan isn't here now,I Can Show something:
1)my spark version is 1.0.1
2) when I use multiple join ,like this:
sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)") 
      
       youhao_data,youhao_age,youhao_totalKiloMeter  were registerAsTable 。

I take the Exception:
Exception in thread "main" java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found

SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)
                                                                                         ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
at org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
at org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)



boyingking@163.com

From: Michael Armbrust
Date: 2014-09-11 00:28
To: Arthur.hk.chan@gmail.com
CC: arunshell87; user@spark.incubator.apache.org
Subject: Re: Spark SQL -- more than two tables for join
What version of Spark SQL are you running here?  I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation.  (Spark 1.1 should be published in the next few days)


In particular, for serious applications you should use a HiveContext and HiveQL as this is a much more complete implementation of a SQL Parser.  The one in SQL context is only suggested if the Hive dependencies conflict with your application.

1)  spark sql does not support multiple join



This is not true.  What problem were you running into?

2)  spark left join: has performance issue



Can you describe your data and query more?

3)  spark sql’s cache table: does not support two-tier query



I'm not sure what you mean here.

4)  spark sql does not support repartition


You can repartition SchemaRDDs in the same way as normal RDDs.

Re: Spark SQL -- more than two tables for join

Posted by Michael Armbrust <mi...@databricks.com>.
What version of Spark SQL are you running here?  I think a lot of your
concerns have likely been addressed in more recent versions of the code /
documentation.  (Spark 1.1 should be published in the next few days)

In particular, for serious applications you should use a HiveContext and
HiveQL as this is a much more complete implementation of a SQL Parser.  The
one in SQL context is only suggested if the Hive dependencies conflict with
your application.


> 1)  spark sql does not support multiple join
>

This is not true.  What problem were you running into?


> 2)  spark left join: has performance issue
>

Can you describe your data and query more?


> 3)  spark sql’s cache table: does not support two-tier query
>

I'm not sure what you mean here.


> 4)  spark sql does not support repartition


You can repartition SchemaRDDs in the same way as normal RDDs.

Re: Spark SQL -- more than two tables for join

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.
Hi 

Some findings: 

1)  spark sql does not support multiple join 
2)  spark left join: has performance issue
3)  spark sql’s cache table: does not support two-tier query 
4)  spark sql does not support repartition

Arthur

On 10 Sep, 2014, at 10:22 pm, Arthur.hk.chan@gmail.com <ar...@gmail.com> wrote:

> Hi,
> 
> May be you can take a look about the following.
> 
> http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
> 
> Good luck.
> Arthur
> 
> On 10 Sep, 2014, at 9:09 pm, arunshell87 <sh...@gmail.com> wrote:
> 
>> 
>> Hi,
>> 
>> I too had tried SQL queries with joins, MINUS , subqueries etc but they did
>> not work in Spark Sql. 
>> 
>> I did not find any documentation on what queries work and what do not work
>> in Spark SQL, may be we have to wait for the Spark book to be released in
>> Feb-2015.
>> 
>> I believe you can try HiveQL in Spark for your requirement.
>> 
>> Thanks,
>> Arun
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark SQL -- more than two tables for join

Posted by "Arthur.hk.chan@gmail.com" <ar...@gmail.com>.
Hi,

May be you can take a look about the following.

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

Good luck.
Arthur

On 10 Sep, 2014, at 9:09 pm, arunshell87 <sh...@gmail.com> wrote:

> 
> Hi,
> 
> I too had tried SQL queries with joins, MINUS , subqueries etc but they did
> not work in Spark Sql. 
> 
> I did not find any documentation on what queries work and what do not work
> in Spark SQL, may be we have to wait for the Spark book to be released in
> Feb-2015.
> 
> I believe you can try HiveQL in Spark for your requirement.
> 
> Thanks,
> Arun
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark SQL -- more than two tables for join

Posted by arunshell87 <sh...@gmail.com>.
Hi,

I too had tried SQL queries with joins, MINUS , subqueries etc but they did
not work in Spark Sql. 

I did not find any documentation on what queries work and what do not work
in Spark SQL, may be we have to wait for the Spark book to be released in
Feb-2015.

I believe you can try HiveQL in Spark for your requirement.

Thanks,
Arun



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org