You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Adam Hunt <ad...@gmail.com> on 2016/01/04 17:58:32 UTC

NPE when reading Parquet using Hive on Tez

Hi,

When I perform any operation on a data set stored in Parquet format using
Hive on Tez, I get an NPE (see bottom for stack trace). The same operation
works fine on tables stored as text, Avro, ORC and Sequence files. The same
query on the parquet tables also works fine if I use Hive on MR.

I'm running MapR 5.0.0  with Hive 1.2.0-mapr-1510, Hadoop 2.7.0-mapr-1506
and Tez 0.7.0 compiled from source.

I'm currently evaluating Hive on Tez as an alternative to keeping the
SparkSQL thrift sever running all the time locking up resources.
Unfortunately, this is a blocker since most of our data is stored in
Parquet files.

Thanks,
Adam

select count(*) from alexa_parquet;
or
create table kmeans_results_100_orc stored as orc as select * from
kmeans_results_100;

], TaskAttempt 3 failed, info=[Error: Failure while running
task:java.lang.RuntimeException: java.lang.RuntimeException:
java.io.IOException: java.lang.NullPointerException
    at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
    at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
    at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:337)
    at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
    at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
    at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
    at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException:
java.lang.NullPointerException
    at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:192)
    at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.<init>(TezGroupedSplitsInputFormat.java:131)
    at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
    at
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
    at
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
    at
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:614)
    at
org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:593)
    at
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:141)
    at
org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
    at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:370)
    at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:127)
    at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
    ... 14 more
Caused by: java.io.IOException: java.lang.NullPointerException
    at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:252)
    at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
    ... 25 more
Caused by: java.lang.NullPointerException
    at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokenize(TypeInfoUtils.java:274)
    at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.<init>(TypeInfoUtils.java:293)
    at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:764)
    at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColumnTypes(DataWritableReadSupport.java:76)
    at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:220)
    at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:256)
    at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:99)
    at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:85)
    at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
    at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:250)
    ... 26 more

Re: NPE when reading Parquet using Hive on Tez

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> I dug a little deeper and it appears that the configuration property
>"columns.types", which is used
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(),
> is not being set. When I manually set that property in hive, your
>example works fine.

Good to know more about the NPE. ORC uses the exact same parameter.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:
columnTypeProperty = conf.get(serdeConstants.LIST_COLUMN_TYPES);

But I think this could have a very simple explanation.

Assuming you have a build of Tez, I would recommend adding a couple of
LOG.warn lines in TezGroupedSplitsInputFormat

public RecordReader<K, V> getRecordReader(InputSplit split, JobConf job,
      Reporter reporter) throws IOException {


Particularly whether the "this.conf" or "job" conf object has the
column.types set?

My guess is that the set; command is setting that up in JobConf & the
default compiler places it in the this.conf object.

If that is the case, we can fix Parquet to pick it up off the right one.

Cheers,
Gopal

Re: NPE when reading Parquet using Hive on Tez

Posted by Adam Hunt <ad...@gmail.com>.

HI Gopal,

With the release of 0.8.2, I thought I would give tez another shot.
Unfortunately, I got the same NPE. I dug a little deeper and it appears
that the configuration property "columns.types", which is used
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(),
is not being set. When I manually set that property in hive, your example
works fine.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set columns.type=int;
hive> select count(*) from x where x.x > 1;
OK
1

I also saw that the configuration parameter parquet.columns.index.access is
also checked in that same function. Setting that property to "true" fixes
my issue.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set parquet.column.index.access=true;
hive> select count(*) from x where x.x > 1;
OK
1

Thanks for your help.

Best,
Adam



On Tue, Jan 5, 2016 at 9:10 AM, Adam Hunt <ad...@gmail.com> wrote:

> Hi Gopal,
>
> Spark does offer dynamic allocation, but it doesn't always work as
> advertised. My experience with Tez has been more in line with my
> expectations. I'll bring up my issues with Spark on that list.
>
> I tried your example and got the same NPE. It might be a mapr-hive issue.
> Thanks for your help.
>
> Adam
>
> On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>
>>
>> > select count(*) from alexa_parquet;
>>
>> > Caused by: java.lang.NullPointerException
>> >    at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
>> >ze(TypeInfoUtils.java:274)
>> >    at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.<init>
>> >(TypeInfoUtils.java:293)
>> >    at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
>> >tring(TypeInfoUtils.java:764)
>> >    at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
>> >nTypes(DataWritableReadSupport.java:76)
>> >    at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
>> >aWritableReadSupport.java:220)
>> >    at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
>> >lit(ParquetRecordReaderWrapper.java:256)
>>
>> This might be an NPE triggered off by a specific case of the type parser.
>>
>> I tested it out on my current build with simple types and it looks like
>> the issue needs more detail on the column types for a repro.
>>
>> hive> create temporary table x (x int) stored as parquet;
>> hive> insert into x values(1),(2);
>> hive> select count(*) from x where x.x > 1;
>> Status: DAG finished successfully in 0.18 seconds
>> OK
>> 1
>> Time taken: 0.792 seconds, Fetched: 1 row(s)
>> hive>
>>
>> Do you have INT96 in the schema?
>>
>> > I'm currently evaluating Hive on Tez as an alternative to keeping the
>> >SparkSQL thrift sever running all the time locking up resources.
>>
>> Tez has a tunable value in tez.am.session.min.held-containers (i.e
>> something small like 10).
>>
>> And HiveServer2 can be made work similarly because spark
>> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>>
>>
>>
>>
>>
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: NPE when reading Parquet using Hive on Tez

Posted by Adam Hunt <ad...@gmail.com>.

Hi Gopal,

Spark does offer dynamic allocation, but it doesn't always work as
advertised. My experience with Tez has been more in line with my
expectations. I'll bring up my issues with Spark on that list.

I tried your example and got the same NPE. It might be a mapr-hive issue.
Thanks for your help.

Adam

On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > select count(*) from alexa_parquet;
>
> > Caused by: java.lang.NullPointerException
> >    at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
> >ze(TypeInfoUtils.java:274)
> >    at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.<init>
> >(TypeInfoUtils.java:293)
> >    at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
> >tring(TypeInfoUtils.java:764)
> >    at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
> >nTypes(DataWritableReadSupport.java:76)
> >    at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
> >aWritableReadSupport.java:220)
> >    at
> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
> >lit(ParquetRecordReaderWrapper.java:256)
>
> This might be an NPE triggered off by a specific case of the type parser.
>
> I tested it out on my current build with simple types and it looks like
> the issue needs more detail on the column types for a repro.
>
> hive> create temporary table x (x int) stored as parquet;
> hive> insert into x values(1),(2);
> hive> select count(*) from x where x.x > 1;
> Status: DAG finished successfully in 0.18 seconds
> OK
> 1
> Time taken: 0.792 seconds, Fetched: 1 row(s)
> hive>
>
> Do you have INT96 in the schema?
>
> > I'm currently evaluating Hive on Tez as an alternative to keeping the
> >SparkSQL thrift sever running all the time locking up resources.
>
> Tez has a tunable value in tez.am.session.min.held-containers (i.e
> something small like 10).
>
> And HiveServer2 can be made work similarly because spark
> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>
>
>
>
>
>
> Cheers,
> Gopal
>
>
>

Re: NPE when reading Parquet using Hive on Tez

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> select count(*) from alexa_parquet;

> Caused by: java.lang.NullPointerException
>    at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
>ze(TypeInfoUtils.java:274)
>    at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.<init>
>(TypeInfoUtils.java:293)
>    at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
>tring(TypeInfoUtils.java:764)
>    at 
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
>nTypes(DataWritableReadSupport.java:76)
>    at 
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
>aWritableReadSupport.java:220)
>    at 
>org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
>lit(ParquetRecordReaderWrapper.java:256)

This might be an NPE triggered off by a specific case of the type parser.

I tested it out on my current build with simple types and it looks like
the issue needs more detail on the column types for a repro.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> select count(*) from x where x.x > 1;
Status: DAG finished successfully in 0.18 seconds
OK
1
Time taken: 0.792 seconds, Fetched: 1 row(s)
hive> 

Do you have INT96 in the schema?

> I'm currently evaluating Hive on Tez as an alternative to keeping the
>SparkSQL thrift sever running all the time locking up resources.

Tez has a tunable value in tez.am.session.min.held-containers (i.e
something small like 10).

And HiveServer2 can be made work similarly because spark
HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.






Cheers,
Gopal