You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2016/11/07 06:26:58 UTC

[jira] [Created] (HIVE-15138) String + Integer gets converted to UDFToDouble causing number format exceptions

Rajesh Balamohan created HIVE-15138:
---------------------------------------

             Summary: String + Integer gets converted to UDFToDouble causing number format exceptions
                 Key: HIVE-15138
                 URL: https://issues.apache.org/jira/browse/HIVE-15138
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan
            Priority: Minor



TPCDS Query 72 has {{"d3.d_date > d1.d_date + 5"}} where in, d_date contains data like {{2002-02-03, 2001-11-07}}. When running this query, compiler converts this into UDFToDouble and causes large number of
{{NumberFormatExceptions}} trying to convert string to double. Example Stack trace is given below, which can be a good amount of perf hit filling up the stack for every row, depending on the amount of data.

{noformat}
"TezTaskRunner" #41340 daemon prio=5 os_prio=0 tid=0x00007f7914745000 nid=0x9725 runnable [0x00007f787ee4a000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Throwable.fillInStackTrace(Native Method)
        at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
        - locked <0x00007f804b125ab0> (a java.lang.NumberFormatException)
        at java.lang.Throwable.<init>(Throwable.java:265)
        at java.lang.Exception.<init>(Exception.java:66)
        at java.lang.RuntimeException.<init>(RuntimeException.java:62)
        at java.lang.IllegalArgumentException.<init>(IllegalArgumentException.java:52)
        at java.lang.NumberFormatException.<init>(NumberFormatException.java:55)
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at org.apache.hadoop.hive.ql.udf.UDFToDouble.evaluate(UDFToDouble.java:172)
        at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:967)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194)
        at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:194)
        at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150)
        at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:121)
        at org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterDoubleColGreaterDoubleColumn.evaluate(FilterDoubleColGreaterDoubleColumn.java:51)
        at org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:110)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
        at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:144)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
        at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinGenerateResultOperator.forwardBigTableBatch(VectorMapJoinGenerateResultOperator.java:600)
        at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinInnerLongOperator.process(VectorMapJoinInnerLongOperator.java:386)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
        at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinGenerateResultOperator.forwardBigTableBatch(VectorMapJoinGenerateResultOperator.java:600)
        at org.apache.hadoop.hive.ql.exec.vector.mapjoin.VectorMapJoinInnerLongOperator.process(VectorMapJoinInnerLongOperator.java:386)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:879)
{noformat}


Simple query to reproduce this issue is given below.  It would be helpful if hive gives explicit WARN messages so that end user can add explicit casts to avoid such situations.

{noformat}

Latest Hive (master): (Check UDFToDouble for d_date field)
====================

hive> explain select distinct d_date + 5 from date_dim limit 10;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: rbalamohan_20161107005816_1cc412bf-c19c-45c4-b468-236e4fc8ae09:8
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: (UDFToDouble(d_date) + 5.0) (type: double)
                    outputColumnNames: _col0
                    Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      keys: _col0 (type: double)
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: double)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: double)
                        Statistics: Num rows: 73049 Data size: 82034027 Basic stats: COMPLETE Column stats: NONE
                        TopN Hash Memory Usage: 0.04
            Execution mode: vectorized, llap
            LLAP IO: all inputs
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: double)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 36524 Data size: 41016452 Basic stats: COMPLETE Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 10 Data size: 11230 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 10 Data size: 11230 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

{noformat}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)