You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/03/07 02:39:32 UTC

[jira] [Commented] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

    [ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898632#comment-15898632 ] 

Apache Spark commented on SPARK-19843:
--------------------------------------

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17184

> UTF8String => (int / long) conversion expensive for invalid inputs
> ------------------------------------------------------------------
>
>                 Key: SPARK-19843
>                 URL: https://issues.apache.org/jira/browse/SPARK-19843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Tejas Patil
>
> In case of invalid inputs, converting a UTF8String to int or long returns null. This comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception handling is expensive as it will convert the UTF8String into a java string, populate the stack trace (which is a native call). While migrating workloads from Hive -> Spark, I see that this at an aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but helpful to capture typical bad inputs eg. empty string, "null".
> [0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org