You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sandeep Khurana <sa...@infoworks.io> on 2016/02/27 11:10:35 UTC

2 tables join happens at Hive but not in spark

Hello

We have 2 tables  (tab1, tab2) exposed using hive. The data is in different
hdfs folders. We are trying to join these 2 tables on certain single column
using sparkR join. But inspite of join columns having same values, it
returns zero rows.

But when I run the same join sql in hive, from hive console, to get the
count(*), I do get millions of records meeting the join criteria.

The join columns are of 'int' type. Also, when I join 'tab1' from one of
these 2 tables for which join is not working with another 3rd table 'tab3'
separately, that join works.

To debug , we selected just 1 row in the sparkR script from tab1 and also 1
row row having the same value of join column from tab2 also. We used
'select' sparkR function for this. Now, our dataframes for tab1 and tab2
have single row each and the join columns have same value in both, but
still joining these 2 dataframes having single row each and with same join
column, the join returned zero rows.


We are running the script from rstudio. It does not give any error. It runs
fine. But gives zero join results whereas on hive I do get many rows for
same join. Any idea what might be the cause of this?



-- 
Architect
Infoworks.io
http://Infoworks.io

Re: 2 tables join happens at Hive but not in spark

Posted by Davies Liu <da...@databricks.com>.

What the schema of the two tables looks like? Could you also show the
explain of the query?

On Sat, Feb 27, 2016 at 2:10 AM, Sandeep Khurana <sa...@infoworks.io> wrote:
> Hello
>
> We have 2 tables  (tab1, tab2) exposed using hive. The data is in different
> hdfs folders. We are trying to join these 2 tables on certain single column
> using sparkR join. But inspite of join columns having same values, it
> returns zero rows.
>
> But when I run the same join sql in hive, from hive console, to get the
> count(*), I do get millions of records meeting the join criteria.
>
> The join columns are of 'int' type. Also, when I join 'tab1' from one of
> these 2 tables for which join is not working with another 3rd table 'tab3'
> separately, that join works.
>
> To debug , we selected just 1 row in the sparkR script from tab1 and also 1
> row row having the same value of join column from tab2 also. We used
> 'select' sparkR function for this. Now, our dataframes for tab1 and tab2
> have single row each and the join columns have same value in both, but still
> joining these 2 dataframes having single row each and with same join column,
> the join returned zero rows.
>
>
> We are running the script from rstudio. It does not give any error. It runs
> fine. But gives zero join results whereas on hive I do get many rows for
> same join. Any idea what might be the cause of this?
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org