You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rakesh Setty <se...@yahoo-inc.com> on 2009/06/30 01:01:52 UTC

Set difference in Hive

Hi,

            I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive?

select user from page_views where user not in (select name from users);

Thanks,
Rakesh

RE: Set difference in Hive

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Thanks Raghu and Namit. This works.

Thanks,
Rakesh

-----Original Message-----
From: Raghu Murthy [mailto:rmurthy@facebook.com] 
Sent: Monday, June 29, 2009 5:22 PM
To: hive-user@hadoop.apache.org
Subject: Re: Set difference in Hive

Do you care about getting duplicate 'user' values in the result? If not, you
could just get rid of duplicates before doing the join.

select a.user from
  (select distinct user from page_views) a
  LEFT OUTER JOIN users b ON (a.user = b.user)
WHERE b.user is NULL

Note that the join runs out of memory in the reducer only if the number of
duplicates is large, not just when the tables are large.

On 6/29/09 5:11 PM, "Rakesh Setty" <se...@yahoo-inc.com> wrote:

> I tried this. Unfortunately, both tables are large.
>  
> Thanks,
> Rakesh
>  
> 
> 
> From: Namit Jain [mailto:njain@facebook.com]
> Sent: Monday, June 29, 2009 5:05 PM
> To: hive-user@hadoop.apache.org
> Subject: RE: Set difference in Hive
>  
> The tables can be large  -
>  
> For a given key,  have the table with the most number of values as the
> rightmost table.
>  
> The problem only happens when both the tables have keys with large number of
> values.
>  
>  
> Thanks,
> -namit
>  
>  
> 
> From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
> Sent: Monday, June 29, 2009 4:43 PM
> To: hive-user@hadoop.apache.org
> Subject: RE: Set difference in Hive
>  
> Thanks very much. But the reducer hangs with the warning WARN
> org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than
> joinEmitInterval rows for join key []
> Both the tables are large and as Zheng mentions at
> http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large
> size for table 0 is a problem. Is there any way to overcome this?
>  
> Thanks,
> Rakesh
>  
> 
> 
> From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
> Sent: Monday, June 29, 2009 4:20 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Set difference in Hive
>  
> Here is an example of what Amr mentioned from one of my Hive scripts, returns
> the set of pages not in "daily_pagecounts_table"
> 
> select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
> FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON
> (dt.page_id = dp.page_id)
> where dp.page_id is NULL
> 
> On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> 
> 
> do an outer join on user and filter on name.user is null
> 
> -- amr
> 
> 
> 
> Rakesh Setty wrote:
> 
> Hi,
> 
>  
> 
>            I am new to Hive. I would like to know what is the easiest way to
> get the difference between two sets. For example, how can I convert the
> following SQL query to Hive?
> 
>  
> 
> select user from page_views where user not in (select name from users);
> 
>  
> 
> Thanks,
> 
> Rakesh
> 
> 
> 


RE: Set difference in Hive

Posted by Namit Jain <nj...@facebook.com>.
It is independent of the size of the table. The cardinality of values for a given key that matters.

Is that also pretty large for both tables ?



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, June 29, 2009 5:11 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive

I tried this. Unfortunately, both tables are large.

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, June 29, 2009 5:05 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive

The tables can be large  -

For a given key,  have the table with the most number of values as the rightmost table.

The problem only happens when both the tables have keys with large number of values.


Thanks,
-namit


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, June 29, 2009 4:43 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive


Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key []

Both the tables are large and as Zheng mentions at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this?



Thanks,

Rakesh

________________________________
From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
Sent: Monday, June 29, 2009 4:20 PM
To: hive-user@hadoop.apache.org
Subject: Re: Set difference in Hive

Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id)
where dp.page_id is NULL
On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com>> wrote:

do an outer join on user and filter on name.user is null

-- amr


Rakesh Setty wrote:

Hi,



            I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive?



select user from page_views where user not in (select name from users);



Thanks,

Rakesh



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Set difference in Hive

Posted by Raghu Murthy <rm...@facebook.com>.
Do you care about getting duplicate 'user' values in the result? If not, you
could just get rid of duplicates before doing the join.

select a.user from
  (select distinct user from page_views) a
  LEFT OUTER JOIN users b ON (a.user = b.user)
WHERE b.user is NULL

Note that the join runs out of memory in the reducer only if the number of
duplicates is large, not just when the tables are large.

On 6/29/09 5:11 PM, "Rakesh Setty" <se...@yahoo-inc.com> wrote:

> I tried this. Unfortunately, both tables are large.
>  
> Thanks,
> Rakesh
>  
> 
> 
> From: Namit Jain [mailto:njain@facebook.com]
> Sent: Monday, June 29, 2009 5:05 PM
> To: hive-user@hadoop.apache.org
> Subject: RE: Set difference in Hive
>  
> The tables can be large  -
>  
> For a given key,  have the table with the most number of values as the
> rightmost table.
>  
> The problem only happens when both the tables have keys with large number of
> values.
>  
>  
> Thanks,
> -namit
>  
>  
> 
> From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
> Sent: Monday, June 29, 2009 4:43 PM
> To: hive-user@hadoop.apache.org
> Subject: RE: Set difference in Hive
>  
> Thanks very much. But the reducer hangs with the warning WARN
> org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than
> joinEmitInterval rows for join key []
> Both the tables are large and as Zheng mentions at
> http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large
> size for table 0 is a problem. Is there any way to overcome this?
>  
> Thanks,
> Rakesh
>  
> 
> 
> From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
> Sent: Monday, June 29, 2009 4:20 PM
> To: hive-user@hadoop.apache.org
> Subject: Re: Set difference in Hive
>  
> Here is an example of what Amr mentioned from one of my Hive scripts, returns
> the set of pages not in "daily_pagecounts_table"
> 
> select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
> FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON
> (dt.page_id = dp.page_id)
> where dp.page_id is NULL
> 
> On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com> wrote:
> 
> 
> do an outer join on user and filter on name.user is null
> 
> -- amr
> 
> 
> 
> Rakesh Setty wrote:
> 
> Hi,
> 
>  
> 
>            I am new to Hive. I would like to know what is the easiest way to
> get the difference between two sets. For example, how can I convert the
> following SQL query to Hive?
> 
>  
> 
> select user from page_views where user not in (select name from users);
> 
>  
> 
> Thanks,
> 
> Rakesh
> 
> 
> 


RE: Set difference in Hive

Posted by Rakesh Setty <se...@yahoo-inc.com>.
I tried this. Unfortunately, both tables are large.

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, June 29, 2009 5:05 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive

The tables can be large  -

For a given key,  have the table with the most number of values as the rightmost table.

The problem only happens when both the tables have keys with large number of values.


Thanks,
-namit


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, June 29, 2009 4:43 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive


Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key []

Both the tables are large and as Zheng mentions at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this?



Thanks,

Rakesh

________________________________
From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
Sent: Monday, June 29, 2009 4:20 PM
To: hive-user@hadoop.apache.org
Subject: Re: Set difference in Hive

Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id)
where dp.page_id is NULL
On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com>> wrote:

do an outer join on user and filter on name.user is null

-- amr


Rakesh Setty wrote:

Hi,



            I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive?



select user from page_views where user not in (select name from users);



Thanks,

Rakesh



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

RE: Set difference in Hive

Posted by Namit Jain <nj...@facebook.com>.
The tables can be large  -

For a given key,  have the table with the most number of values as the rightmost table.

The problem only happens when both the tables have keys with large number of values.


Thanks,
-namit


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, June 29, 2009 4:43 PM
To: hive-user@hadoop.apache.org
Subject: RE: Set difference in Hive


Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key []

Both the tables are large and as Zheng mentions at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this?



Thanks,

Rakesh

________________________________
From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
Sent: Monday, June 29, 2009 4:20 PM
To: hive-user@hadoop.apache.org
Subject: Re: Set difference in Hive

Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id)
where dp.page_id is NULL
On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com>> wrote:

do an outer join on user and filter on name.user is null

-- amr


Rakesh Setty wrote:

Hi,



            I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive?



select user from page_views where user not in (select name from users);



Thanks,

Rakesh



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

RE: Set difference in Hive

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key []

Both the tables are large and as Zheng mentions at http://www.mail-archive.com/hive-user@hadoop.apache.org/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this?



Thanks,

Rakesh

________________________________
From: Peter Skomoroch [mailto:peter.skomoroch@gmail.com]
Sent: Monday, June 29, 2009 4:20 PM
To: hive-user@hadoop.apache.org
Subject: Re: Set difference in Hive

Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id)
where dp.page_id is NULL
On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com>> wrote:

do an outer join on user and filter on name.user is null

-- amr


Rakesh Setty wrote:

Hi,



            I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive?



select user from page_views where user not in (select name from users);



Thanks,

Rakesh



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Set difference in Hive

Posted by Peter Skomoroch <pe...@gmail.com>.
Here is an example of what Amr mentioned from one of my Hive scripts,
returns the set of pages not in "daily_pagecounts_table"

select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews
FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON
(dt.page_id = dp.page_id)
where dp.page_id is NULL

On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <aa...@cloudera.com> wrote:

>
> do an outer join on user and filter on name.user is null
>
> -- amr
>
>
> Rakesh Setty wrote:
>
>  Hi,
>
>
>
>             I am new to Hive. I would like to know what is the easiest way
> to get the difference between two sets. For example, how can I convert the
> following SQL query to Hive?
>
>
>
> select user from page_views where user not in (select name from users);
>
>
>
> Thanks,
>
> Rakesh
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Set difference in Hive

Posted by Amr Awadallah <aa...@cloudera.com>.
do an outer join on user and filter on name.user is null

-- amr

Rakesh Setty wrote:
>
> Hi,
>
>  
>
>             I am new to Hive. I would like to know what is the easiest 
> way to get the difference between two sets. For example, how can I 
> convert the following SQL query to Hive?
>
>  
>
> select user from page_views where user not in (select name from users);
>
>  
>
> Thanks,
>
> Rakesh
>