You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Yong Zhang <ja...@hotmail.com> on 2016/03/11 15:43:21 UTC

Hive 0.12 MAPJOIN hangs sometimes

Hi, Hive users:
Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.
Right now, we are facing an issue in the Hive 0.12.
We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.
So below is the query example:
set hive.exec.compress.output=true;set parquet.compression=snappy;set mapred.reduce.tasks=1;set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;set mapred.task.timeout=7200000;set mapred.map.tasks.speculative.execution=false;set hive.ignore.mapjoin.hint=false;set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
insert overwrite table a(dt='${hiveconf:run_date}', source='ip')select  /*+ MAPJOIN(trial_event) */xxxx
The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.
I add the debug info in the Hive, and found the following message:
2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0
Then there is no more log after that for hours.
If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.
When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive version as now, any workaround do we have?
Thanks
Yong

Re: Hive 0.12 MAPJOIN hangs sometimes

Posted by Jason Dere <jd...@hortonworks.com>.

A join between bigint and string might actually be evaluated by converting both values to a double .. try doing an EXPLAIN of the query, that might show what conversion is being done for the join keys.

If that is the case, you could try explicitly casting the join keys to either string or bigint in the join clause, to see if Prasanth's suggestion of hash joins with doubles is causing the slowness.

________________________________
From: Yong Zhang <ja...@hotmail.com>
Sent: Friday, March 11, 2016 10:55 AM
To: user@hive.apache.org
Subject: RE: Hive 0.12 MAPJOIN hangs sometimes

No, The join columns are ("bigint" and "string").

Yong

________________________________
Subject: RE: Hive 0.12 MAPJOIN hangs sometimes
From: pjayachandran@hortonworks.com
To: user@hive.apache.org; user@hive.apache.org
Date: Fri, 11 Mar 2016 15:48:52 +0000

Is the join column of type double? If so there is a known issue with DoubleWritable hash collisions that makes hash join insanely slow.

Thanks
Prasanth

On Fri, Mar 11, 2016 at 7:33 AM -0800, "Yong Zhang" <ja...@hotmail.com>> wrote:

I understand the Hive version problem.

We are using IBM BigInsights V3.0.0.2, which comes with Hadoop 2.2.0 and Hive 0.12.  It is extremely difficult to upgrade to BigInsights v4.x, as IBM did V4 totally different as V3. We are looking for the option to upgrade, but it won't be a fast way.

The query and log is very big, so I attached them in the file.

Thanks

Yong

________________________________
From: jornfranke@gmail.com
Subject: Re: Hive 0.12 MAPJOIN hangs sometimes
Date: Fri, 11 Mar 2016 15:55:42 +0100
To: user@hive.apache.org

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc or parquet as a table format and tez (preferred) or spark as an execution engine.

To your questions:
It seems that the logger is configured wrongly that is why you may miss some messages.

What is the exact join query. Hive on older version needed a special syntax if you wanted to benefit from partition pruning.

Which Hadoop version are you using.

On 11 Mar 2016, at 15:43, Yong Zhang <ja...@hotmail.com>> wrote:

Hi, Hive users:

Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.

Right now, we are facing an issue in the Hive 0.12.

We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.

So below is the query example:

set hive.exec.compress.output=true;
set parquet.compression=snappy;
set mapred.reduce.tasks=1;
set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;
set mapred.task.timeout=7200000;
set mapred.map.tasks.speculative.execution=false;
set hive.ignore.mapjoin.hint=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert overwrite table a(dt='${hiveconf:run_date}', source='ip')
select
  /*+ MAPJOIN(trial_event) */
xxxx

The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.

I add the debug info in the Hive, and found the following message:

2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0

Then there is no more log after that for hours.

If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.

When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive version as now, any workaround do we have?

Thanks

Yong

RE: Hive 0.12 MAPJOIN hangs sometimes

Posted by Yong Zhang <ja...@hotmail.com>.

No, The join columns are ("bigint" and "string").
Yong

Subject: RE: Hive 0.12 MAPJOIN hangs sometimes
From: pjayachandran@hortonworks.com
To: user@hive.apache.org; user@hive.apache.org
Date: Fri, 11 Mar 2016 15:48:52 +0000

Is the join column of type double? If so there is a known issue with DoubleWritable hash collisions that makes hash join insanely slow. 

Thanks

Prasanth

On Fri, Mar 11, 2016 at 7:33 AM -0800, "Yong Zhang" 
<ja...@hotmail.com> wrote:

I understand the Hive version problem.

We are using IBM BigInsights V3.0.0.2, which comes with Hadoop 2.2.0 and Hive 0.12.  It is extremely difficult to upgrade to BigInsights v4.x, as IBM did V4 totally different as V3. We are looking for the option to upgrade, but it won't be a fast way.

The query and log is very big, so I attached them in the file.

Thanks 

Yong

From: jornfranke@gmail.com

Subject: Re: Hive 0.12 MAPJOIN hangs sometimes

Date: Fri, 11 Mar 2016 15:55:42 +0100

To: user@hive.apache.org

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc
 or parquet as a table format and tez (preferred) or spark as an execution engine.

To your questions: 
It seems that the logger is configured wrongly that is why you may miss some messages.

What is the exact join query. Hive on older version needed a special syntax if you wanted to benefit from partition pruning.

Which Hadoop version are you using.

On 11 Mar 2016, at 15:43, Yong Zhang <ja...@hotmail.com> wrote:

Hi, Hive users:

Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.

Right now, we are facing an issue in the Hive 0.12.

We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.

So below is the query example:

set hive.exec.compress.output=true;
set parquet.compression=snappy;
set mapred.reduce.tasks=1;
set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;
set mapred.task.timeout=7200000;
set mapred.map.tasks.speculative.execution=false;
set hive.ignore.mapjoin.hint=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert overwrite table a(dt='${hiveconf:run_date}', source='ip')
select
  /*+ MAPJOIN(trial_event) */

xxxx

The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.

I add the debug info in the Hive, and found the following message:

2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See 
http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0

Then there is no more log after that for hours.

If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.

When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our
 Hive version as now, any workaround do we have?

Thanks

Yong

RE: Hive 0.12 MAPJOIN hangs sometimes

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Is the join column of type double? If so there is a known issue with DoubleWritable hash collisions that makes hash join insanely slow.

Thanks
Prasanth




On Fri, Mar 11, 2016 at 7:33 AM -0800, "Yong Zhang" <ja...@hotmail.com>> wrote:

I understand the Hive version problem.

We are using IBM BigInsights V3.0.0.2, which comes with Hadoop 2.2.0 and Hive 0.12.  It is extremely difficult to upgrade to BigInsights v4.x, as IBM did V4 totally different as V3. We are looking for the option to upgrade, but it won't be a fast way.

The query and log is very big, so I attached them in the file.

Thanks

Yong

________________________________
From: jornfranke@gmail.com
Subject: Re: Hive 0.12 MAPJOIN hangs sometimes
Date: Fri, 11 Mar 2016 15:55:42 +0100
To: user@hive.apache.org

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc or parquet as a table format and tez (preferred) or spark as an execution engine.

To your questions:
It seems that the logger is configured wrongly that is why you may miss some messages.

What is the exact join query. Hive on older version needed a special syntax if you wanted to benefit from partition pruning.

Which Hadoop version are you using.



On 11 Mar 2016, at 15:43, Yong Zhang <ja...@hotmail.com>> wrote:

Hi, Hive users:

Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.

Right now, we are facing an issue in the Hive 0.12.

We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.

So below is the query example:

set hive.exec.compress.output=true;
set parquet.compression=snappy;
set mapred.reduce.tasks=1;
set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;
set mapred.task.timeout=7200000;
set mapred.map.tasks.speculative.execution=false;
set hive.ignore.mapjoin.hint=false;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert overwrite table a(dt='${hiveconf:run_date}', source='ip')
select
  /*+ MAPJOIN(trial_event) */
xxxx

The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.

I add the debug info in the Hive, and found the following message:

2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed
16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0

Then there is no more log after that for hours.

If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.

When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive version as now, any workaround do we have?

Thanks

Yong

RE: Hive 0.12 MAPJOIN hangs sometimes

Posted by Yong Zhang <ja...@hotmail.com>.

I understand the Hive version problem.
We are using IBM BigInsights V3.0.0.2, which comes with Hadoop 2.2.0 and Hive 0.12.  It is extremely difficult to upgrade to BigInsights v4.x, as IBM did V4 totally different as V3. We are looking for the option to upgrade, but it won't be a fast way.
The query and log is very big, so I attached them in the file.
Thanks 
Yong

From: jornfranke@gmail.com
Subject: Re: Hive 0.12 MAPJOIN hangs sometimes
Date: Fri, 11 Mar 2016 15:55:42 +0100
To: user@hive.apache.org

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc or parquet as a table format and tez (preferred) or spark as an execution engine.
To your questions: It seems that the logger is configured wrongly that is why you may miss some messages.
What is the exact join query. Hive on older version needed a special syntax if you wanted to benefit from partition pruning.
Which Hadoop version are you using.

On 11 Mar 2016, at 15:43, Yong Zhang <ja...@hotmail.com> wrote:

Hi, Hive users:
Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.
Right now, we are facing an issue in the Hive 0.12.
We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.
So below is the query example:
set hive.exec.compress.output=true;set parquet.compression=snappy;set mapred.reduce.tasks=1;set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;set mapred.task.timeout=7200000;set mapred.map.tasks.speculative.execution=false;set hive.ignore.mapjoin.hint=false;set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
insert overwrite table a(dt='${hiveconf:run_date}', source='ip')select  /*+ MAPJOIN(trial_event) */xxxx
The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.
I add the debug info in the Hive, and found the following message:
2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0
Then there is no more log after that for hours.
If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.
When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive version as now, any workaround do we have?
Thanks
Yong

Re: Hive 0.12 MAPJOIN hangs sometimes

Posted by Jörn Franke <jo...@gmail.com>.

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably your query would execute in less than a minute. If your Hadoop vendor does not support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum including using Orc or parquet as a table format and tez (preferred) or spark as an execution engine.

To your questions: 
It seems that the logger is configured wrongly that is why you may miss some messages.

What is the exact join query. Hive on older version needed a special syntax if you wanted to benefit from partition pruning.

Which Hadoop version are you using.



> On 11 Mar 2016, at 15:43, Yong Zhang <ja...@hotmail.com> wrote:
> 
> Hi, Hive users:
> 
> Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old version, but upgrade still has some long path to go.
> 
> Right now, we are facing an issue in the Hive 0.12.
> 
> We have one ETL kind of steps implemented in Hive, and due to the data volume in this step, we know that MAPJOIN is the right way to go, as one side of data is very small, but the other size is much larger.
> 
> So below is the query example:
> 
> set hive.exec.compress.output=true;
> set parquet.compression=snappy;
> set mapred.reduce.tasks=1;
> set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;
> set mapred.task.timeout=7200000;
> set mapred.map.tasks.speculative.execution=false;
> set hive.ignore.mapjoin.hint=false;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> 
> insert overwrite table a(dt='${hiveconf:run_date}', source='ip')
> select
>   /*+ MAPJOIN(trial_event) */
> xxxx
> 
> The above query can be finished daily around 10 minutes, which we are very happy about it. But sometimes, the query will be hang hours in the ETL, until we manually kill it.
> 
> I add the debug info in the Hive, and found the following message:
> 
> 2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum memory = 536870912
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> 16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: closed
> 16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0
> 
> Then there is no more log after that for hours.
> 
> If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 hours.
> 
> When this happens, I can see the NameNode works fine, I can run all kinds of "HDFS" operation without any issue, while this query is hanging. What does this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive version as now, any workaround do we have?
> 
> Thanks
> 
> Yong