You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Marc Limotte <ms...@gmail.com> on 2010/08/09 04:33:00 UTC

NullPointerException in GenericUDTFExplode.process()

Hi,

I think I may have run into a Hive bug.  And I'm not sure what's causing it
or how to work around it.

The reduce task log contains this exception:

<td><pre>java.io.IOException: java.lang.NullPointerException
    at
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:227)
    at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NullPointerException
    at
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode.process(GenericUDTFExplode.java:70)
    at
org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:81)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at
org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:46)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at
org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:43)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:218)

This works fine for millions of rows of data, but the one row below causes
the whole job to fail.  Looking at the row, I don't see anything that
distinguishes it... if I knew what it was about the row that caused a
problem I could filter it out before hand.  I don't mind losing one row in a
million.

2010-08-05^A15^A^AUS^A1281022768^Af^A97^Aonline car insurance
quote^Aborderdisorder.com^A\N^A^A1076^B1216^B1480^B1481^B1493^B1496^B1497^B1504^B1509^B1686^B1724^B1729^B1819^B1829^B1906^B1995^B2018^B2025^B421^B426^B428^B433^B436^B449^B450^B452^B462^B508^B530^B-

The source table and query are:

CREATE TABLE IF NOT EXISTS tmp3 (
  dt                  STRING,
  hr                  STRING,
  fld1                  STRING,
  fld2             STRING,
  stamp               BIGINT,
  fld3             STRING,
  fld4             INT,
  rk     STRING,
  rd     STRING,
  rq      STRING,
  kl        ARRAY<String>,
  receiver_code_list  ARRAY<String>
)
ROW FORMAT DELIMITED
STORED AS SEQUENCEFILE;

-- The limit 88 below is so that the one bad row is included, if I limit to
87 it works without failure.
SELECT count(1)
FROM (select receiver_code_list from tmp3 limit 88) tmp5
LATERAL VIEW explode(receiver_code_list) rcl AS receiver_code;

Any tips on what is wrong, or how else I might go about debugging it would
be appreciated.  Or a way to have it skip rows that cause errors would be an
acceptable solution as well.

Thanks,
Marc

Re: NullPointerException in GenericUDTFExplode.process()

Posted by Marc Limotte <ms...@gmail.com>.

Hi Paul,

No nulls.  I ensure that every row has at least one entry (a hyphen) before
I split to create the list.

Marc

On Sun, Aug 8, 2010 at 8:14 PM, Paul Yang <py...@facebook.com> wrote:

>  Seem like an issue that was patched already – can you check to see if the
> column that you are calling explode() with has any null values?
>
>
>
> *From:* Marc Limotte [mailto:mslimotte@gmail.com]
> *Sent:* Sunday, August 08, 2010 7:33 PM
>
> *To:* hive-user@hadoop.apache.org
> *Subject:* NullPointerException in GenericUDTFExplode.process()
>
>
>
> Hi,
>
> I think I may have run into a Hive bug.  And I'm not sure what's causing it
> or how to work around it.
>
> The reduce task log contains this exception:
>
> <td><pre>java.io.IOException: java.lang.NullPointerException
>     at
> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:227)
>     at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.NullPointerException
>     at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode.process(GenericUDTFExplode.java:70)
>     at
> org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:81)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:46)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:43)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at
> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:218)
>
> This works fine for millions of rows of data, but the one row below causes
> the whole job to fail.  Looking at the row, I don't see anything that
> distinguishes it... if I knew what it was about the row that caused a
> problem I could filter it out before hand.  I don't mind losing one row in a
> million.
>
> 2010-08-05^A15^A^AUS^A1281022768^Af^A97^Aonline car insurance
> quote^Aborderdisorder.com^A\N^A^A1076^B1216^B1480^B1481^B1493^B1496^B1497^B1504^B1509^B1686^B1724^B1729^B1819^B1829^B1906^B1995^B2018^B2025^B421^B426^B428^B433^B436^B449^B450^B452^B462^B508^B530^B-
>
>
> The source table and query are:
>
> CREATE TABLE IF NOT EXISTS tmp3 (
>   dt                  STRING,
>   hr                  STRING,
>   fld1                  STRING,
>   fld2             STRING,
>   stamp               BIGINT,
>   fld3             STRING,
>   fld4             INT,
>   rk     STRING,
>   rd     STRING,
>   rq      STRING,
>   kl        ARRAY<String>,
>   receiver_code_list  ARRAY<String>
> )
> ROW FORMAT DELIMITED
> STORED AS SEQUENCEFILE;
>
>
>
> -- The limit 88 below is so that the one bad row is included, if I limit to
> 87 it works without failure.
> SELECT count(1)
> FROM (select receiver_code_list from tmp3 limit 88) tmp5
> LATERAL VIEW explode(receiver_code_list) rcl AS receiver_code;
>
>
> Any tips on what is wrong, or how else I might go about debugging it would
> be appreciated.  Or a way to have it skip rows that cause errors would be an
> acceptable solution as well.
>
> Thanks,
> Marc
>
>

RE: NullPointerException in GenericUDTFExplode.process()

Posted by Paul Yang <py...@facebook.com>.

Seem like an issue that was patched already - can you check to see if the column that you are calling explode() with has any null values?

From: Marc Limotte [mailto:mslimotte@gmail.com]
Sent: Sunday, August 08, 2010 7:33 PM
To: hive-user@hadoop.apache.org
Subject: NullPointerException in GenericUDTFExplode.process()

Hi,

I think I may have run into a Hive bug.  And I'm not sure what's causing it or how to work around it.

The reduce task log contains this exception:
<td><pre>java.io.IOException: java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:227)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode.process(GenericUDTFExplode.java:70)
    at org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:81)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:46)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
    at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:43)
    at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
    at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:218)
This works fine for millions of rows of data, but the one row below causes the whole job to fail.  Looking at the row, I don't see anything that distinguishes it... if I knew what it was about the row that caused a problem I could filter it out before hand.  I don't mind losing one row in a million.
2010-08-05^A15^A^AUS^A1281022768^Af^A97^Aonline car insurance quote^Aborderdisorder.com^A\N^A^A1076^B1216^B1480^B1481^B1493^B1496^B1497^B1504^B1509^B1686^B1724^B1729^B1819^B1829^B1906^B1995^B2018^B2025^B421^B426^B428^B433^B436^B449^B450^B452^B462^B508^B530^B-

The source table and query are:
CREATE TABLE IF NOT EXISTS tmp3 (
  dt                  STRING,
  hr                  STRING,
  fld1                  STRING,
  fld2             STRING,
  stamp               BIGINT,
  fld3             STRING,
  fld4             INT,
  rk     STRING,
  rd     STRING,
  rq      STRING,
  kl        ARRAY<String>,
  receiver_code_list  ARRAY<String>
)
ROW FORMAT DELIMITED
STORED AS SEQUENCEFILE;

-- The limit 88 below is so that the one bad row is included, if I limit to 87 it works without failure.
SELECT count(1)
FROM (select receiver_code_list from tmp3 limit 88) tmp5
LATERAL VIEW explode(receiver_code_list) rcl AS receiver_code;

Any tips on what is wrong, or how else I might go about debugging it would be appreciated.  Or a way to have it skip rows that cause errors would be an acceptable solution as well.

Thanks,
Marc

Re: NullPointerException in GenericUDTFExplode.process()

Posted by Marc Limotte <ms...@gmail.com>.

Also wanted to mention that I'm using the Cloudera distribution of Hive
(0.5.0+20-2) on CentOS.

Marc

On Sun, Aug 8, 2010 at 7:33 PM, Marc Limotte <ms...@gmail.com> wrote:

> Hi,
>
> I think I may have run into a Hive bug.  And I'm not sure what's causing it
> or how to work around it.
>
> The reduce task log contains this exception:
>
> <td><pre>java.io.IOException: java.lang.NullPointerException
>     at
> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:227)
>     at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>     at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.NullPointerException
>     at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode.process(GenericUDTFExplode.java:70)
>     at
> org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:81)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:46)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:598)
>     at
> org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:43)
>     at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:386)
>     at
> org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:218)
>
> This works fine for millions of rows of data, but the one row below causes
> the whole job to fail.  Looking at the row, I don't see anything that
> distinguishes it... if I knew what it was about the row that caused a
> problem I could filter it out before hand.  I don't mind losing one row in a
> million.
>
> 2010-08-05^A15^A^AUS^A1281022768^Af^A97^Aonline car insurance
> quote^Aborderdisorder.com^A\N^A^A1076^B1216^B1480^B1481^B1493^B1496^B1497^B1504^B1509^B1686^B1724^B1729^B1819^B1829^B1906^B1995^B2018^B2025^B421^B426^B428^B433^B436^B449^B450^B452^B462^B508^B530^B-
>
> The source table and query are:
>
> CREATE TABLE IF NOT EXISTS tmp3 (
>   dt                  STRING,
>   hr                  STRING,
>   fld1                  STRING,
>   fld2             STRING,
>   stamp               BIGINT,
>   fld3             STRING,
>   fld4             INT,
>   rk     STRING,
>   rd     STRING,
>   rq      STRING,
>   kl        ARRAY<String>,
>   receiver_code_list  ARRAY<String>
> )
> ROW FORMAT DELIMITED
> STORED AS SEQUENCEFILE;
>
> -- The limit 88 below is so that the one bad row is included, if I limit to
> 87 it works without failure.
> SELECT count(1)
> FROM (select receiver_code_list from tmp3 limit 88) tmp5
> LATERAL VIEW explode(receiver_code_list) rcl AS receiver_code;
>
> Any tips on what is wrong, or how else I might go about debugging it would
> be appreciated.  Or a way to have it skip rows that cause errors would be an
> acceptable solution as well.
>
> Thanks,
> Marc
>
>
>