You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by hadoop n00b <ne...@gmail.com> on 2011/03/16 11:33:19 UTC

Fwd: Hadoop error 2 while joining two large tables

 Hello,

I am trying to execute a query that joins two large tables (3 million and 20
million records). I am getting the Hadoop error code 2 during execution.
This happens mainly while the reducers are running. Sometimes the reducers
complete 100% and then the error comes. The logs talk about running out of
Heap space and GC overhead limit exceeding.

I am running a 6 node cluster with child JVM memory of 1GB.

Are there any parameters I could tweak to make them run? Is adding more
nodes the solution to such problem?

Thanks!

Re: Hadoop error 2 while joining two large tables

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Mar 16, 2011 at 12:51 PM, Christopher, Pat
<pa...@hp.com> wrote:
> Are you using Hive on top of Hadoop or writing a raw Hadoop job?
>
>
>
> This is a the hive list so I’m going to assumed you’re running hive...  can
> you send your HiveQL query along?
>
>
>
> Pat
>
>
>
> From: hadoop n00b [mailto:new2hive@gmail.com]
> Sent: Wednesday, March 16, 2011 3:33 AM
> To: user@hive.apache.org
> Subject: Fwd: Hadoop error 2 while joining two large tables
>
>
>
> Hello,
>
>
>
> I am trying to execute a query that joins two large tables (3 million and 20
> million records). I am getting the Hadoop error code 2 during execution.
> This happens mainly while the reducers are running. Sometimes the reducers
> complete 100% and then the error comes. The logs talk about running out of
> Heap space and GC overhead limit exceeding.
>
>
>
> I am running a 6 node cluster with child JVM memory of 1GB.
>
>
>
> Are there any parameters I could tweak to make them run? Is adding more
> nodes the solution to such problem?
>
>
>
> Thanks!

First make sure you are on a recent hive 0.6.0 is the latest release.
Next thing you should do is always put the larger table on the right.
The next is to try setting mapred.child.java.opts -Xmx high thener it
is now, sometimes 1024M is needed
The third issue is that the data can be skewed. IE one key has
millions of rows,, others have none. There is a hive optimizer
variable for a skew join the might help.

Re: Hadoop error 2 while joining two large tables

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Mar 17, 2011 at 10:34 AM,  <be...@yahoo.com> wrote:
> Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you work with open source it is definitely a good practice to upgrade those with latest versions. With newer versions bugs would be minimal , performance would be better and you get more functionalities. Your query looks fine an upgrade of hive could sort things out.
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: Edward Capriolo <ed...@gmail.com>
> Date: Thu, 17 Mar 2011 08:51:05
> To: user@hive.apache.org<us...@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: Re: Hadoop error 2 while joining two large tables
>
> I am pretty sure the cloudera distro has an upgrade path to a more recent hive.
>
> On Thursday, March 17, 2011, hadoop n00b <ne...@gmail.com> wrote:
>> Hello All,
>>
>> Thanks a lot for your response. To clarify a few points -
>>
>> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we have to use Cloudera distro only.
>>
>> All records in the smaller table have at least one record in the larger table (of course a few exceptions could be there but only a few).
>>
>> The join is using ON clause. The query is something like -
>>
>> select ...
>> from
>> (
>>   (select ... from smaller_table)
>>   join
>>   (select from larger_table)
>>   on (smaller_table.col = larger_table.col)
>> )
>>
>> I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know.
>>
>> Is there a pattern or rule of thumb to follow on when to add more nodes?
>>
>> Thanks again!
>>
>> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote:
>>
>>
>>
>> In addition, put the smaller table on the left-hand side of a JOIN:
>>
>> SELECT ... FROM small_table JOIN large_table ON ...
>>
>>
>>
>>
>> From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
>> Sent: Wednesday, March 16, 2011 11:43 AM
>>
>> To: user@hive.apache.org
>> Subject: Re: Hadoop error 2 while joining two large tables
>>
>>
>>
>>
>>
>>
>> Hey hadoop n00b
>>     I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE.
>>
>> Say if you have a query this way
>> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
>> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>>
>> You can definitely re frame this query as
>> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
>> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>>
>> From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps.
>>
>> Regards
>> Bejoy.K.S
>>
>>
>>
>>
>>
>>
>>
>>
>> From: "Sunderlin, Mark" <ma...@teamaol.com>
>> To: "user@hive.apache.org" <us...@hive.apache.org>
>> Sent: Wed, March 16, 2011 11:22:09 PM
>> Subject: RE: Hadoop error 2 while joining two large tables
>>
>>
>>
>>
>> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>>
>> Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-)
>>
>>
>> ---
>> Mark E. Sunderlin
>> Solutions Architect |AOL Data Warehouse
>>
>> P: 703-256-6935 | C: 540-327-6222
>>
>> AIM: MESunderlin
>> 22000 AOL Way
>

In particular I remember back in the hive 0.3 and 0.4 days, many
troublesome issues like this were encountered. 0.5.0 had a lot of
engineering went in to make joins work better with less memory and
avoid many OutOfMemory conditions.

You can see some of my older struggles:

http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201002.mbox/%3C8901CA2D-18EE-486C-B6D3-D1E8FA975C9C@facebook.com%3E

Upgrading is HIGHLY suggested. I remember 0.4.0-> 0.5.0 made problems
vanish. (that was a happy day)

Re: Hadoop error 2 while joining two large tables

Posted by be...@yahoo.com.

Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you work with open source it is definitely a good practice to upgrade those with latest versions. With newer versions bugs would be minimal , performance would be better and you get more functionalities. Your query looks fine an upgrade of hive could sort things out. 
Regards
Bejoy K S

-----Original Message-----
From: Edward Capriolo <ed...@gmail.com>
Date: Thu, 17 Mar 2011 08:51:05 
To: user@hive.apache.org<us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b <ne...@gmail.com> wrote:
> Hello All,
>
> Thanks a lot for your response. To clarify a few points -
>
> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we have to use Cloudera distro only.
>
> All records in the smaller table have at least one record in the larger table (of course a few exceptions could be there but only a few).
>
> The join is using ON clause. The query is something like -
>
> select ...
> from
> (
>   (select ... from smaller_table)
>   join
>   (select from larger_table)
>   on (smaller_table.col = larger_table.col)
> )
>
> I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know.
>
> Is there a pattern or rule of thumb to follow on when to add more nodes?
>
> Thanks again!
>
> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote:
>
>
>
> In addition, put the smaller table on the left-hand side of a JOIN:
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
> From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> Sent: Wednesday, March 16, 2011 11:43 AM
>
> To: user@hive.apache.org
> Subject: Re: Hadoop error 2 while joining two large tables
>
>
>
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>
>
>
>
> From: "Sunderlin, Mark" <ma...@teamaol.com>
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Sent: Wed, March 16, 2011 11:22:09 PM
> Subject: RE: Hadoop error 2 while joining two large tables
>
>
>
>
> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
> Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-)
>
>
> ---
> Mark E. Sunderlin
> Solutions Architect |AOL Data Warehouse
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
> 22000 AOL Way

Re: Hadoop error 2 while joining two large tables

Posted by Edward Capriolo <ed...@gmail.com>.

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b <ne...@gmail.com> wrote:
> Hello All,
>
> Thanks a lot for your response. To clarify a few points -
>
> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we have to use Cloudera distro only.
>
> All records in the smaller table have at least one record in the larger table (of course a few exceptions could be there but only a few).
>
> The join is using ON clause. The query is something like -
>
> select ...
> from
> (
>   (select ... from smaller_table)
>   join
>   (select from larger_table)
>   on (smaller_table.col = larger_table.col)
> )
>
> I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know.
>
> Is there a pattern or rule of thumb to follow on when to add more nodes?
>
> Thanks again!
>
> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote:
>
>
>
> In addition, put the smaller table on the left-hand side of a JOIN:
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
> From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> Sent: Wednesday, March 16, 2011 11:43 AM
>
> To: user@hive.apache.org
> Subject: Re: Hadoop error 2 while joining two large tables
>
>
>
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>
>
>
>
> From: "Sunderlin, Mark" <ma...@teamaol.com>
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Sent: Wed, March 16, 2011 11:22:09 PM
> Subject: RE: Hadoop error 2 while joining two large tables
>
>
>
>
> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
> Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-)
>
>
> ---
> Mark E. Sunderlin
> Solutions Architect |AOL Data Warehouse
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
> 22000 AOL Way

Re: Hadoop error 2 while joining two large tables

Posted by hadoop n00b <ne...@gmail.com>.

Hello All,

Thanks a lot for your response. To clarify a few points -

I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of
Hive as we have to use Cloudera distro only.

All records in the smaller table have at least one record in the larger
table (of course a few exceptions could be there but only a few).

The join is using ON clause. The query is something like -

select ...
from
(
  (select ... from smaller_table)
  join
  (select from larger_table)
  on (smaller_table.col = larger_table.col)
)
I will try out setting mapred.child.java.opts -Xmx to a higher value and let
you know.

Is there a pattern or rule of thumb to follow on when to add more nodes?

Thanks again!
On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <sw...@netflix.com> wrote:

>  In addition, put the smaller table on the left-hand side of a JOIN:
>
>
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
>
> *From:* Bejoy Ks [mailto:bejoy_ks@yahoo.com]
> *Sent:* Wednesday, March 16, 2011 11:43 AM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hadoop error 2 while joining two large tables
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your
> query to get things rolling. I'm not sure on your hive Query.But still, from
> my experience with joins on huge tables (record counts in the range of
> hundreds of millions) you should give join conditions with JOIN ON clause
> rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 >
> b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the
> inequality conditions there within JOIN ON, inequality should come to WHERE.
> This approach has worked for me when I encountered a similar situation as
> yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>  ------------------------------
>
> *From:* "Sunderlin, Mark" <ma...@teamaol.com>
> *To:* "user@hive.apache.org" <us...@hive.apache.org>
> *Sent:* Wed, March 16, 2011 11:22:09 PM
> *Subject:* RE: Hadoop error 2 while joining two large tables
>
>
>  hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
>
>
> Whatever else answers you get, you should append “ … and add more nodes.”
> More nodes is never a bad thing ;-)
>
>
>
> ---
>
> *Mark E. Sunderlin*
>
> *Solutions Architect **|AOL **Data Warehouse*
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
>
> 22000 AOL Way | Dulles, VA | 20166
>
>
>
>
>
> *From:* hadoop n00b [mailto:new2hive@gmail.com]
> *Sent:* Wednesday, March 16, 2011 3:33 AM
> *To:* user@hive.apache.org
> *Subject:* Fwd: Hadoop error 2 while joining two large tables
>
>
>
> Hello,
>
>
>
> I am trying to execute a query that joins two large tables (3 million and
> 20 million records). I am getting the Hadoop error code 2 during execution.
> This happens mainly while the reducers are running. Sometimes the reducers
> complete 100% and then the error comes. The logs talk about running out of
> Heap space and GC overhead limit exceeding.
>
>
>
> I am running a 6 node cluster with child JVM memory of 1GB.
>
>
>
> Are there any parameters I could tweak to make them run? Is adding more
> nodes the solution to such problem?
>
>
>
> Thanks!
>
>
>

RE: Hadoop error 2 while joining two large tables

Posted by Steven Wong <sw...@netflix.com>.

In addition, put the smaller table on the left-hand side of a JOIN:

SELECT ... FROM small_table JOIN large_table ON ...

From: Bejoy Ks [mailto:bejoy_ks@yahoo.com]
Sent: Wednesday, March 16, 2011 11:43 AM
To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

Hey hadoop n00b
    I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE.

Say if you have a query this way
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;

You can definitely re frame this query as
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;

From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps.

Regards
Bejoy.K.S

________________________________
From: "Sunderlin, Mark" <ma...@teamaol.com>
To: "user@hive.apache.org" <us...@hive.apache.org>
Sent: Wed, March 16, 2011 11:22:09 PM
Subject: RE: Hadoop error 2 while joining two large tables

hadoop n00b asks, “Is adding more nodes the solution to such problem?”

Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-)

---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166

From: hadoop n00b [mailto:new2hive@gmail.com]
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables

Hello,

I am trying to execute a query that joins two large tables (3 million and 20 million records). I am getting the Hadoop error code 2 during execution. This happens mainly while the reducers are running. Sometimes the reducers complete 100% and then the error comes. The logs talk about running out of Heap space and GC overhead limit exceeding.

I am running a 6 node cluster with child JVM memory of 1GB.

Are there any parameters I could tweak to make them run? Is adding more nodes the solution to such problem?

Thanks!

Re: Hadoop error 2 while joining two large tables

Posted by Bejoy Ks <be...@yahoo.com>.

Hey hadoop n00b
    I second Mark's thought. But definitely you can try out re framing your 
query to get things rolling. I'm not sure on your hive Query.But still, from my 
experience with joins on huge tables (record counts in the range of hundreds of 
millions) you should give join conditions with JOIN ON clause rather than 
specifying all conditions in WHERE.

Say if you have a query this way
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;

You can definitely re frame this query as
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b 
ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;

From my understanding Hive supports equijoins so you can't have the inequality 
conditions there within JOIN ON, inequality should come to WHERE. This approach 
has worked for me when I encountered a similar situation as yours some time ago. 
Try this out,hope it helps.

Regards
Bejoy.K.S






________________________________
From: "Sunderlin, Mark" <ma...@teamaol.com>
To: "user@hive.apache.org" <us...@hive.apache.org>
Sent: Wed, March 16, 2011 11:22:09 PM
Subject: RE: Hadoop error 2 while joining two large tables

  
hadoop n00b asks, “Is adding more nodes the solution to such problem?”
 
Whatever else answers you get, you should append “ … and add more nodes.” More 
nodes is never a bad thing ;-)
 
---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166
 
 
From:hadoop n00b [mailto:new2hive@gmail.com] 
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables
 
Hello,
 
I am trying to execute a query that joins two large tables (3 million and 20 
million records). I am getting the Hadoop error code 2 during execution. This 
happens mainly while the reducers are running. Sometimes the reducers complete 
100%  and then the error comes. The logs talk about running out of Heap space 
and GC overhead limit exceeding.
 
I am running a 6 node cluster with child JVM memory of 1GB.
 
Are there any parameters I could tweak to make them run? Is adding more nodes 
the solution to such problem?
 
Thanks!

RE: Hadoop error 2 while joining two large tables

Posted by "Sunderlin, Mark" <ma...@teamaol.com>.

hadoop n00b asks, "Is adding more nodes the solution to such problem?"

Whatever else answers you get, you should append " ... and add more nodes." More nodes is never a bad thing ;-)

---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166


From: hadoop n00b [mailto:new2hive@gmail.com]
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables

Hello,

I am trying to execute a query that joins two large tables (3 million and 20 million records). I am getting the Hadoop error code 2 during execution. This happens mainly while the reducers are running. Sometimes the reducers complete 100% and then the error comes. The logs talk about running out of Heap space and GC overhead limit exceeding.

I am running a 6 node cluster with child JVM memory of 1GB.

Are there any parameters I could tweak to make them run? Is adding more nodes the solution to such problem?

Thanks!

RE: Hadoop error 2 while joining two large tables

Posted by "Christopher, Pat" <pa...@hp.com>.

Are you using Hive on top of Hadoop or writing a raw Hadoop job?

This is a the hive list so I'm going to assumed you're running hive...  can you send your HiveQL query along?

Pat

From: hadoop n00b [mailto:new2hive@gmail.com]
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables

Hello,

I am trying to execute a query that joins two large tables (3 million and 20 million records). I am getting the Hadoop error code 2 during execution. This happens mainly while the reducers are running. Sometimes the reducers complete 100% and then the error comes. The logs talk about running out of Heap space and GC overhead limit exceeding.

I am running a 6 node cluster with child JVM memory of 1GB.

Are there any parameters I could tweak to make them run? Is adding more nodes the solution to such problem?

Thanks!