You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by William Slacum <ws...@gmail.com> on 2015/08/05 22:48:06 UTC

Hive on Tez much slower than MR

Hi all,

I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0.

I have a very simple query of the form `select count(*) from my_table where
x > 0 and x < 1500`.

The table has ~50 columns in it and not all are populated. My total dataset
size is ~20TB. When I run with MapReduce, I can generally see a mapper pull
through ~100k records in a few seconds. The MR job, in total, takes about 2
minutes.

If all I do is set `hive.execution.engine=tez`, I end up getting a similar
number of Map tasks for Tez, but after 30 minutes or so they aren't
completed. I don't have much insight into what's going on.

I have confirmed the following:

1) Usually about 10 TezChild tasks are executed on a single node.
2) Each one is using greater than 100% CPU, but less than 150% CPU.
3) When I jstack a random task, it's usually generating a
NumberFormatException. The stack trace will be available below, but it
looks like when an expected byte column is null or empty, LazyInteger#parse
throws a NumberFormatException and LazyByte#init swallows it and sets some
default value.
4) The worker will log a record count every time it reaches some power 10.
For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking
5-10 minutes for 10,000 records.

My gut tells me that #3 is my issue (with #4 being a symptom), since in my
experience continual exception creation can be a performance killer.
However, I haven't been able to confirm that the logic for processing a row
is actually different between Tez and MR.

Any thing I should check or try to tweak to get around this?

Here's the stacktrace:

Thread 6127: (state = IN_VM)

- java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame;
information may be imprecise)

- java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled frame)

- java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265 (Compiled
frame)

- java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled
frame)

- java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62
(Compiled frame)

- java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2,
line=53 (Compiled frame)

- java.lang.NumberFormatException.<init>(java.lang.String) @bci=2, line=55
(Compiled frame)

- org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int, int,
int) @bci=62, line=104 (Compiled frame)

- org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int,
int) @bci=4, line=94 (Compiled frame)

-
org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef,
int, int) @bci=15, line=52 (Compiled frame)

-
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField()
@bci=101, line=111 (Compiled frame)

- org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int)
@bci=6, line=172 (Compiled frame)

-
org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object,
org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67
(Compiled frame)

-
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object)
@bci=53, line=394 (Compiled frame)

-
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable,
org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137
(Compiled frame)

-
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx,
org.apache.hadoop.io.Writable,
org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100
(Compiled frame)

-
org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable)
@bci=57, line=492 (Compiled frame)

-
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object)
@bci=20, line=83 (Compiled frame)

- org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord() @bci=40,
line=68 (Compiled frame)

- org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9,
line=294 (Compiled frame)

-
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map,
java.util.Map) @bci=224, line=163 (Interpreted frame)
- org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map,
java.util.Map)

Re: Hive on Tez much slower than MR

Posted by William Slacum <ws...@gmail.com>.
Hey Jörn, thanks for the response! Unfortunately I'm kinda stuck on the
version I am. We do plan on moving to ORC at some point.

I need to dig more into the implementation of how Vectorized execution
works. The documentation (
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution)
mentions ORC, but I guess I don't quite understand the requirement, unless
all data is stored in ORC (even intermediate data between some map work and
some reduce work).

Thanks,
Bill

On Thu, Aug 6, 2015 at 2:05 AM, Jörn Franke <jo...@gmail.com> wrote:

> Always use the newest version of Hive. You should use orc or parquet
> wherever possible. If you use orc then you should explicitly enable storage
> indexes and insert your table sorted (eg for the query below you would sort
> on x). Additionally you should enable statistics.
>
> Compression may bring additional performance gains. If you use orc or
> parquet then all compression algorithms are splittable.
>
> Le jeu. 6 août 2015 à 8:11, Bill Slacum <ws...@gmail.com> a écrit :
>
>> I was able to bring the performance in line with MR by enabling reduce
>> side vectorization, which apparently wasn't enabled in my cluster. The
>> documentation regarding this is odd as it says ORC is required, but none of
>> my tables are using ORC.
>>
>>
>>
>> On Aug 5, 2015, at 3:48 PM, William Slacum <ws...@gmail.com> wrote:
>>
>> Hi all,
>>
>> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0.
>>
>> I have a very simple query of the form `select count(*) from my_table
>> where x > 0 and x < 1500`.
>>
>> The table has ~50 columns in it and not all are populated. My total
>> dataset size is ~20TB. When I run with MapReduce, I can generally see a
>> mapper pull through ~100k records in a few seconds. The MR job, in total,
>> takes about 2 minutes.
>>
>> If all I do is set `hive.execution.engine=tez`, I end up getting a
>> similar number of Map tasks for Tez, but after 30 minutes or so they aren't
>> completed. I don't have much insight into what's going on.
>>
>> I have confirmed the following:
>>
>> 1) Usually about 10 TezChild tasks are executed on a single node.
>> 2) Each one is using greater than 100% CPU, but less than 150% CPU.
>> 3) When I jstack a random task, it's usually generating a
>> NumberFormatException. The stack trace will be available below, but it
>> looks like when an expected byte column is null or empty, LazyInteger#parse
>> throws a NumberFormatException and LazyByte#init swallows it and sets some
>> default value.
>> 4) The worker will log a record count every time it reaches some power
>> 10. For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking
>> 5-10 minutes for 10,000 records.
>>
>> My gut tells me that #3 is my issue (with #4 being a symptom), since in
>> my experience continual exception creation can be a performance killer.
>> However, I haven't been able to confirm that the logic for processing a row
>> is actually different between Tez and MR.
>>
>> Any thing I should check or try to tweak to get around this?
>>
>> Here's the stacktrace:
>>
>> Thread 6127: (state = IN_VM)
>>
>> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame;
>> information may be imprecise)
>>
>> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled
>> frame)
>>
>> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265
>> (Compiled frame)
>>
>> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled
>> frame)
>>
>> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62
>> (Compiled frame)
>>
>> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2,
>> line=53 (Compiled frame)
>>
>> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2,
>> line=55 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int,
>> int, int) @bci=62, line=104 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int,
>> int) @bci=4, line=94 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef,
>> int, int) @bci=15, line=52 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField()
>> @bci=101, line=111 (Compiled frame)
>>
>> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int)
>> @bci=6, line=172 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object,
>> org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object)
>> @bci=53, line=394 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable,
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx,
>> org.apache.hadoop.io.Writable,
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100
>> (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable)
>> @bci=57, line=492 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object)
>> @bci=20, line=83 (Compiled frame)
>>
>> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord()
>> @bci=40, line=68 (Compiled frame)
>>
>> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9,
>> line=294 (Compiled frame)
>>
>> -
>> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map,
>> java.util.Map) @bci=224, line=163 (Interpreted frame)
>> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map,
>> java.util.Map)
>>
>>

Re: Hive on Tez much slower than MR

Posted by Jörn Franke <jo...@gmail.com>.
Always use the newest version of Hive. You should use orc or parquet
wherever possible. If you use orc then you should explicitly enable storage
indexes and insert your table sorted (eg for the query below you would sort
on x). Additionally you should enable statistics.

Compression may bring additional performance gains. If you use orc or
parquet then all compression algorithms are splittable.

Le jeu. 6 août 2015 à 8:11, Bill Slacum <ws...@gmail.com> a écrit :

> I was able to bring the performance in line with MR by enabling reduce
> side vectorization, which apparently wasn't enabled in my cluster. The
> documentation regarding this is odd as it says ORC is required, but none of
> my tables are using ORC.
>
>
>
> On Aug 5, 2015, at 3:48 PM, William Slacum <ws...@gmail.com> wrote:
>
> Hi all,
>
> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0.
>
> I have a very simple query of the form `select count(*) from my_table
> where x > 0 and x < 1500`.
>
> The table has ~50 columns in it and not all are populated. My total
> dataset size is ~20TB. When I run with MapReduce, I can generally see a
> mapper pull through ~100k records in a few seconds. The MR job, in total,
> takes about 2 minutes.
>
> If all I do is set `hive.execution.engine=tez`, I end up getting a similar
> number of Map tasks for Tez, but after 30 minutes or so they aren't
> completed. I don't have much insight into what's going on.
>
> I have confirmed the following:
>
> 1) Usually about 10 TezChild tasks are executed on a single node.
> 2) Each one is using greater than 100% CPU, but less than 150% CPU.
> 3) When I jstack a random task, it's usually generating a
> NumberFormatException. The stack trace will be available below, but it
> looks like when an expected byte column is null or empty, LazyInteger#parse
> throws a NumberFormatException and LazyByte#init swallows it and sets some
> default value.
> 4) The worker will log a record count every time it reaches some power 10.
> For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking
> 5-10 minutes for 10,000 records.
>
> My gut tells me that #3 is my issue (with #4 being a symptom), since in my
> experience continual exception creation can be a performance killer.
> However, I haven't been able to confirm that the logic for processing a row
> is actually different between Tez and MR.
>
> Any thing I should check or try to tweak to get around this?
>
> Here's the stacktrace:
>
> Thread 6127: (state = IN_VM)
>
> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame;
> information may be imprecise)
>
> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled frame)
>
> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265 (Compiled
> frame)
>
> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled
> frame)
>
> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62
> (Compiled frame)
>
> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2,
> line=53 (Compiled frame)
>
> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2, line=55
> (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int,
> int, int) @bci=62, line=104 (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int,
> int) @bci=4, line=94 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef,
> int, int) @bci=15, line=52 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField()
> @bci=101, line=111 (Compiled frame)
>
> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int)
> @bci=6, line=172 (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object,
> org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object)
> @bci=53, line=394 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable,
> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx,
> org.apache.hadoop.io.Writable,
> org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100
> (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable)
> @bci=57, line=492 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object)
> @bci=20, line=83 (Compiled frame)
>
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord() @bci=40,
> line=68 (Compiled frame)
>
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9,
> line=294 (Compiled frame)
>
> -
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map,
> java.util.Map) @bci=224, line=163 (Interpreted frame)
> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map,
> java.util.Map)
>
>

Re: Hive on Tez much slower than MR

Posted by Bill Slacum <ws...@gmail.com>.
I was able to bring the performance in line with MR by enabling reduce side vectorization, which apparently wasn't enabled in my cluster. The documentation regarding this is odd as it says ORC is required, but none of my tables are using ORC. 



> On Aug 5, 2015, at 3:48 PM, William Slacum <ws...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm using Hive 0.14, Tez 0.5.2, and Hadoop 2.6.0. 
> 
> I have a very simple query of the form `select count(*) from my_table where x > 0 and x < 1500`.
> 
> The table has ~50 columns in it and not all are populated. My total dataset size is ~20TB. When I run with MapReduce, I can generally see a mapper pull through ~100k records in a few seconds. The MR job, in total, takes about 2 minutes.
> 
> If all I do is set `hive.execution.engine=tez`, I end up getting a similar number of Map tasks for Tez, but after 30 minutes or so they aren't completed. I don't have much insight into what's going on.
> 
> I have confirmed the following:
> 
> 1) Usually about 10 TezChild tasks are executed on a single node.
> 2) Each one is using greater than 100% CPU, but less than 150% CPU.
> 3) When I jstack a random task, it's usually generating a NumberFormatException. The stack trace will be available below, but it looks like when an expected byte column is null or empty, LazyInteger#parse throws a NumberFormatException and LazyByte#init swallows it and sets some default value. 
> 4) The worker will log a record count every time it reaches some power 10. For the MR tasks, it rips through 100k+ in a few seconds. Tez is taking 5-10 minutes for 10,000 records.
> 
> My gut tells me that #3 is my issue (with #4 being a symptom), since in my experience continual exception creation can be a performance killer. However, I haven't been able to confirm that the logic for processing a row is actually different between Tez and MR.
> 
> Any thing I should check or try to tweak to get around this?
> 
> Here's the stacktrace:
> 
> Thread 6127: (state = IN_VM)
> 
> - java.lang.Throwable.fillInStackTrace(int) @bci=0 (Compiled frame; information may be imprecise)
> 
> - java.lang.Throwable.fillInStackTrace() @bci=16, line=783 (Compiled frame)
> 
> - java.lang.Throwable.<init>(java.lang.String) @bci=24, line=265 (Compiled frame)
> 
> - java.lang.Exception.<init>(java.lang.String) @bci=2, line=66 (Compiled frame)
> 
> - java.lang.RuntimeException.<init>(java.lang.String) @bci=2, line=62 (Compiled frame)
> 
> - java.lang.IllegalArgumentException.<init>(java.lang.String) @bci=2, line=53 (Compiled frame)
> 
> - java.lang.NumberFormatException.<init>(java.lang.String) @bci=2, line=55 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.lazy.LazyInteger.parseInt(byte[], int, int, int) @bci=62, line=104 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.lazy.LazyByte.parseByte(byte[], int, int, int) @bci=4, line=94 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.lazy.LazyByte.init(org.apache.hadoop.hive.serde2.lazy.ByteArrayRef, int, int) @bci=15, line=52 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField() @bci=101, line=111 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(int) @bci=6, line=172 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(java.lang.Object, org.apache.hadoop.hive.serde2.objectinspector.StructField) @bci=60, line=67 (Compiled frame)
> 
> - org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(java.lang.Object) @bci=53, line=394 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(org.apache.hadoop.io.Writable, org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=16, line=137 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx, org.apache.hadoop.io.Writable, org.apache.hadoop.hive.ql.exec.mr.ExecMapperContext) @bci=3, line=100 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.MapOperator.process(org.apache.hadoop.io.Writable) @bci=57, line=492 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(java.lang.Object) @bci=20, line=83 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord() @bci=40, line=68 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run() @bci=9, line=294 (Compiled frame)
> 
> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(java.util.Map, java.util.Map) @bci=224, line=163 (Interpreted frame)
> 
> - org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(java.util.Map, java.util.Map)