You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by zsampson <zs...@palantir.com> on 2015/06/12 05:08:07 UTC

When to expect UTF8String?

I'm hoping for some clarity about when to expect String vs UTF8String when
using the Java DataFrames API.

In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
once a String is now a UTF8String. The comments in the file and the related
commit message indicate that maybe it should be internal to SparkSQL's
implementation. 

However, when I add a column containing a custom subclass of Expression, the
row passed to the eval method contains instances of UTF8String. Ditto for
AggregateFunction.update. Is this expected? If so, when should I generally
know to deal with UTF8String objects?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: When to expect UTF8String?

Posted by Michael Armbrust <mi...@databricks.com>.

>
> 1. Custom aggregators that do map-side combine.
>

This is something I'd hoping to add in Spark 1.5


> 2. UDFs with more than 22 arguments which is not supported by ScalaUdf,
> and to avoid wrapping a Java function interface in one of 22 different
> Scala function interfaces depending on the number of parameters.
>

I'm super open to suggestions here.  Mind possibly opening a JIRA with a
proposed interface?

RE: When to expect UTF8String?

Posted by Zack Sampson <zs...@palantir.com>.

We are using Expression for two things.

1. Custom aggregators that do map-side combine.

2. UDFs with more than 22 arguments which is not supported by ScalaUdf, and to avoid wrapping a Java function interface in one of 22 different Scala function interfaces depending on the number of parameters.

Are there methods we can use to convert to/from the internal representation in these cases?
________________________________________
From: Michael Armbrust [michael@databricks.com]
Sent: Thursday, June 11, 2015 9:05 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: When to expect UTF8String?

Through the DataFrame API, users should never see UTF8String.

Expression (and any class in the catalyst package) is considered internal and so uses the internal representation of various types.  Which type we use here is not stable across releases.

Is there a reason you aren't defining a UDF instead?

On Thu, Jun 11, 2015 at 8:08 PM, zsampson <zs...@palantir.com>> wrote:
I'm hoping for some clarity about when to expect String vs UTF8String when
using the Java DataFrames API.

In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
once a String is now a UTF8String. The comments in the file and the related
commit message indicate that maybe it should be internal to SparkSQL's
implementation.

However, when I add a column containing a custom subclass of Expression, the
row passed to the eval method contains instances of UTF8String. Ditto for
AggregateFunction.update. Is this expected? If so, when should I generally
know to deal with UTF8String objects?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_When-2Dto-2Dexpect-2DUTF8String-2Dtp12710.html&d=BQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=JTeV6BsFY8hARUm33aoIqBdzwQIcuTioZt881I11O_M&m=03dQBm7iPTCL33eIdtabOwGkj02beDizwxfaDAv1Xhs&s=EhYOx1s29rjLhkJfDhjQ_9QFNdw0GZ_YxaV6ZiXuqas&e=>
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: When to expect UTF8String?

Posted by Michael Armbrust <mi...@databricks.com>.

Through the DataFrame API, users should never see UTF8String.

Expression (and any class in the catalyst package) is considered internal
and so uses the internal representation of various types.  Which type we
use here is not stable across releases.

Is there a reason you aren't defining a UDF instead?

On Thu, Jun 11, 2015 at 8:08 PM, zsampson <zs...@palantir.com> wrote:

> I'm hoping for some clarity about when to expect String vs UTF8String when
> using the Java DataFrames API.
>
> In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
> once a String is now a UTF8String. The comments in the file and the related
> commit message indicate that maybe it should be internal to SparkSQL's
> implementation.
>
> However, when I add a column containing a custom subclass of Expression,
> the
> row passed to the eval method contains instances of UTF8String. Ditto for
> AggregateFunction.update. Is this expected? If so, when should I generally
> know to deal with UTF8String objects?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/When-to-expect-UTF8String-tp12710.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>