You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2017/03/07 00:49:33 UTC

[jira] [Created] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

Tejas Patil created SPARK-19843:
-----------------------------------

             Summary: UTF8String => (int / long) conversion expensive for invalid inputs
                 Key: SPARK-19843
                 URL: https://issues.apache.org/jira/browse/SPARK-19843
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Tejas Patil


In case of invalid inputs, converting a UTF8String to int or long returns null. This comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception handling is expensive as it will convert the UTF8String into a java string, populate the stack trace (which is a native call). While migrating workloads from Hive -> Spark, I see that this at an aggregate level affects the performance of queries in comparison with hive.

The exception is just indicating that the conversion failed.. its not propagated to users so it would be good to avoid.

Couple of options:
- Return Integer / Long (instead of primitive types) which can be set to NULL if the conversion fails. This is boxing and super bad for perf so a big no.
- Hive has a pre-check [1] for this which is not a perfect safety net but helpful to capture typical bad inputs eg. empty string, "null".

[0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
[1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org