You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Evan Pollan <Ev...@bazaarvoice.com> on 2012/02/21 21:56:07 UTC

Custom SerDe -- tracking down stack trace

I have a custom SerDe that's initializing properly and works on one data set.  I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack.  The exception is showing up in the hive output ("Failed with exception java.io.IOException:java.lang.NullPointerException"), but I can't find the stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks.  I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance.  I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?

Re: Custom SerDe -- tracking down stack trace

Posted by Evan Pollan <Ev...@bazaarvoice.com>.
So, I tracked down the problem.  But, I'm curious as to why I got such different behavior when selecting directly from the partition vs. selecting from all partitions.

Context:  my custom deserializer was returning null when it encountered an unintelligible line (I saw this pattern in the contrib RegexSerDe and reused it).  This was apparently causing the LazySimpleSerDe.serialize() operation to NPE as the CLI driver was fetching the results when selecting directly from the partition with the bad line:

2012-02-21 22:55:14,166 ERROR CliDriver (SessionState.java:printError(365)) - Failed with exception java.io.IOException:java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1114)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:232)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: java.lang.NullPointerException
        at java.util.ArrayList.addAll(ArrayList.java:497)
        at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldsDataAsList(UnionStructObjectInspector.java:144)
        at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:357)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:142)
        ... 9 more

However, when I queried across the entire data set (eliminating the partition predicate), there query returns without any errors.  Does the CLI behave differently based on the query plan?



From: Evan Pollan <ev...@bazaarvoice.com>>
Reply-To: <us...@hive.apache.org>>
Date: Wed, 22 Feb 2012 12:28:26 +0000
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: Custom SerDe -- tracking down stack trace

/tmp it is!  My bad — it was the one obvious place I omitted from my find/grep statement.  Thanks!

From: Matthew Byrd <mb...@acunu.com>>
Reply-To: <us...@hive.apache.org>>
Date: Wed, 22 Feb 2012 11:54:09 +0000
To: <us...@hive.apache.org>>
Subject: Re: Custom SerDe -- tracking down stack trace

Hi Evan,
Did you look in your hive.log file?
Mine is found in /tmp/$USER/ ... usually where stack traces from hive cli show up If I'm not mistaken.
Have you tried hooking up a debugger to hive yet also? I'm guessing this is how you knew the null pointer was being thrown on deserialize?
what is actually null?
Matt


On Tue, Feb 21, 2012 at 11:01 PM, Evan Pollan <Ev...@bazaarvoice.com>> wrote:
One more data point:  I can read data from this partition as long as I don't reference the partition explicitly…

E.g., I my partition column is "ArrivalDate", and I have several different partitions:  "2012-02-01"…, and a partition with my test data with ArrivalDate="test".

This works:  'select * from table where <some constraint such that I only get results from the "test" partition>'.

And this works:  'select * from table where ArrivalDate="2012-02-01"'

But, this fails:  'select * from table where ArrivalDate="test"'

Does this make sense to anybody?



From: Evan Pollan <ev...@bazaarvoice.com>>
Reply-To: <us...@hive.apache.org>>
Date: Tue, 21 Feb 2012 20:56:07 +0000
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Custom SerDe -- tracking down stack trace

I have a custom SerDe that's initializing properly and works on one data set.  I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack.  The exception is showing up in the hive output ("Failed with exception java.io.IOException:java.lang.NullPointerException"), but I can't find the stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks.  I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance.  I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?


Re: Custom SerDe -- tracking down stack trace

Posted by Evan Pollan <Ev...@bazaarvoice.com>.
/tmp it is!  My bad — it was the one obvious place I omitted from my find/grep statement.  Thanks!

From: Matthew Byrd <mb...@acunu.com>>
Reply-To: <us...@hive.apache.org>>
Date: Wed, 22 Feb 2012 11:54:09 +0000
To: <us...@hive.apache.org>>
Subject: Re: Custom SerDe -- tracking down stack trace

Hi Evan,
Did you look in your hive.log file?
Mine is found in /tmp/$USER/ ... usually where stack traces from hive cli show up If I'm not mistaken.
Have you tried hooking up a debugger to hive yet also? I'm guessing this is how you knew the null pointer was being thrown on deserialize?
what is actually null?
Matt


On Tue, Feb 21, 2012 at 11:01 PM, Evan Pollan <Ev...@bazaarvoice.com>> wrote:
One more data point:  I can read data from this partition as long as I don't reference the partition explicitly…

E.g., I my partition column is "ArrivalDate", and I have several different partitions:  "2012-02-01"…, and a partition with my test data with ArrivalDate="test".

This works:  'select * from table where <some constraint such that I only get results from the "test" partition>'.

And this works:  'select * from table where ArrivalDate="2012-02-01"'

But, this fails:  'select * from table where ArrivalDate="test"'

Does this make sense to anybody?



From: Evan Pollan <ev...@bazaarvoice.com>>
Reply-To: <us...@hive.apache.org>>
Date: Tue, 21 Feb 2012 20:56:07 +0000
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Custom SerDe -- tracking down stack trace

I have a custom SerDe that's initializing properly and works on one data set.  I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack.  The exception is showing up in the hive output ("Failed with exception java.io.IOException:java.lang.NullPointerException"), but I can't find the stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks.  I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance.  I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?


Re: Custom SerDe -- tracking down stack trace

Posted by Matthew Byrd <mb...@acunu.com>.
Hi Evan,
Did you look in your hive.log file?
Mine is found in /tmp/$USER/ ... usually where stack traces from hive cli
show up If I'm not mistaken.
Have you tried hooking up a debugger to hive yet also? I'm guessing this is
how you knew the null pointer was being thrown on deserialize?
what is actually null?
Matt


On Tue, Feb 21, 2012 at 11:01 PM, Evan Pollan
<Ev...@bazaarvoice.com>wrote:

>  One more data point:  I can read data from this partition as long as I
> don't reference the partition explicitly…
>
>  E.g., I my partition column is "ArrivalDate", and I have several
> different partitions:  "2012-02-01"…, and a partition with my test data
> with ArrivalDate="test".
>
>  This works:  'select * from table where <some constraint such that I
> only get results from the "test" partition>'.
>
>  And this works:  'select * from table where ArrivalDate="2012-02-01"'
>
>  But, this fails:  'select * from table where ArrivalDate="test"'
>
>  Does this make sense to anybody?
>
>
>
>   From: Evan Pollan <ev...@bazaarvoice.com>
> Reply-To: <us...@hive.apache.org>
> Date: Tue, 21 Feb 2012 20:56:07 +0000
> To: "user@hive.apache.org" <us...@hive.apache.org>
> Subject: Custom SerDe -- tracking down stack trace
>
>   I have a custom SerDe that's initializing properly and works on one
> data set.  I built it to adapt to a couple of different data formats,
> though, and it's choking on a different data set (different partitions in
> the same table).
>
>  A null pointer exception is being thrown on deserialize, that's being
> wrapped by an IOException somewhere up the stack.  The exception is showing
> up in the hive output ("Failed with exception
> java.io.IOException:java.lang.NullPointerException"), but I can't find the
> stack trace in any logs.
>
>  It's worth noting that I'm running hive via the cli on a machine
> external to the cluster, and the query doesn't get far enough to create any
> M/R tasks.  I looked in all log files in /var/log on the hive client
> machine, and in all userlogs on each cluster instance.  I also looked in
> derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore
> on the hive client machine.
>
>  I'm sure I'm missing something obvious…  Any ideas?
>
>

Re: Custom SerDe -- tracking down stack trace

Posted by Evan Pollan <Ev...@bazaarvoice.com>.
One more data point:  I can read data from this partition as long as I don't reference the partition explicitly…

E.g., I my partition column is "ArrivalDate", and I have several different partitions:  "2012-02-01"…, and a partition with my test data with ArrivalDate="test".

This works:  'select * from table where <some constraint such that I only get results from the "test" partition>'.

And this works:  'select * from table where ArrivalDate="2012-02-01"'

But, this fails:  'select * from table where ArrivalDate="test"'

Does this make sense to anybody?



From: Evan Pollan <ev...@bazaarvoice.com>>
Reply-To: <us...@hive.apache.org>>
Date: Tue, 21 Feb 2012 20:56:07 +0000
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Custom SerDe -- tracking down stack trace

I have a custom SerDe that's initializing properly and works on one data set.  I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack.  The exception is showing up in the hive output ("Failed with exception java.io.IOException:java.lang.NullPointerException"), but I can't find the stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks.  I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance.  I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?