You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:29 UTC

[jira] [Updated] (SPARK-11907) Allowing errors as values in DataFrames (like 'Either Left/Right')

     [ https://issues.apache.org/jira/browse/SPARK-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-11907:
---------------------------------
    Labels: bulk-closed  (was: )

> Allowing errors as values in DataFrames (like 'Either Left/Right')
> ------------------------------------------------------------------
>
>                 Key: SPARK-11907
>                 URL: https://issues.apache.org/jira/browse/SPARK-11907
>             Project: Spark
>          Issue Type: Wish
>          Components: SQL
>            Reporter: Tycho Grouwstra
>            Priority: Major
>              Labels: bulk-closed
>
> I like Spark, but one thing I find funny about it is that it is picky about circumstantial errors. For one, given the following:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val rows = (1,"a") :: (2,"b") :: (3,"c") :: (0,"d") :: Nil
> val df = sqlContext.createDataFrame(sc.parallelize(rows)).toDF("num","let")
> val div = udf[Double, Integer](10 / _)
> df.withColumn("div", div(col("num"))).show()
> {code}
> ... the job fails with a `java.lang.ArithmeticException: / by zero`.
> The example is trivial, but my point is, if one thing goes wrong, the rest goes right, why throw out the baby with the bathwater when you could both show what went wrong as well as went right?
> Instead, I would propose allowing to use raised Exceptions as resulting values, not unlike how one might store 'bad' results using Either Left/Right constructions in Scala/Haskell (which I suppose would not currently work in DFs, lacking serializability), or cells containing errors in MS Excel.
> As a solution, I would propose a DataFrame subclass (?) using a variant of NullableColumnBuilder, e.g. ErrorableColumnBuilder (/ SafeColumnBuilder?).
> NullableColumnBuilder currently explains its workings as follows:
> {code}
> /**
>  * A stackable trait used for building byte buffer for a column containing null values.  Memory
>  * layout of the final byte buffer is:
>  * {{{
>  *    .------------------- Null count N (4 bytes)
>  *    |   .--------------- Null positions (4 x N bytes, empty if null count is zero)
>  *    |   |     .--------- Non-null elements
>  *    V   V     V
>  *   +---+-----+---------+
>  *   |   | ... | ... ... |
>  *   +---+-----+---------+
>  * }}}
>  */
> {code}
> This might be extended by adding a further section storing Throwables (or null) for the bad values in question (alt: store count/positions separately from null ones so null values would not need to be stored). 
> Don't get me wrong, there is nothing with throwing exceptions (or catching them for that matter). Rather, I see a use cases for both "do it right or bust" vs. the explorative "show me what happens if I try this operation on these values" -- not unlike how languages as Ruby/Elixir might distinguish unsafe methods using a bang ('!') from their safe variants that should not throw global exceptions.
> I'm sort of new here but would be glad to get some opinions on this idea.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org