You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2014/09/02 23:21:21 UTC
[jira] [Commented] (SPARK-3354) Add LENGTH and DATALENGTH functions to Spark SQL

    [ https://issues.apache.org/jira/browse/SPARK-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118771#comment-14118771 ] 

Nicholas Chammas commented on SPARK-3354:
-----------------------------------------

[~marmbrus] - This one's for your radar.

> Add LENGTH and DATALENGTH functions to Spark SQL
> ------------------------------------------------
>
>                 Key: SPARK-3354
>                 URL: https://issues.apache.org/jira/browse/SPARK-3354
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Nicholas Chammas
>
> It's very common when working on data sets of strings to want to know the lengths of the strings you are analyzing. Say I have some Tweets and want to find the average length of a Tweet by language.
> {code}
> SELECT language, AVG(LEN(tweet)) AS avg_length
> FROM tweets
> GROUP BY language
> ORDER BY avg_length DESC;
> {code}
> This is currently not possible because Spark SQL doesn't have a {{LEN()}} function.
> Another common function that would be useful is one that gives the size of the data item in bytes. This can be handy when moving data from Spark SQL to another system and you need to know how to size the receiving fields appropriately.
> *Proposal*
> * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This function returns the number of characters in a string expression.
> * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. This function returns the number of bytes in any expression.
> *Special care* must be given to the following cases:
> * multi-byte characters
> * {{NULL}}
> * trailing spaces
> *Examples*
> These are suggestions for how these 2 functions should work.
> {code}
> LEN('Hello') -> 5
> LEN('안녕') -> 2
> LEN('Hello 안녕') -> 8
> LEN(NULL) -> NULL
> LEN('') -> 0
> LEN('Bob  ') -> 3
> {code}
> In this last example with {{'Bob  '}}, trailing spaces are ignored. This matches the [behavior of SQL Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt to include the spaces.
> {code}
> DATALEN('Hello') -> 5
> DATALEN('안녕') -> 4
> DATALEN('Hello 안녕') -> 16
> DATALEN(NULL) -> NULL
> DATALEN('') -> 0
> DATALEN('Bob  ') -> 5
> {code}
> Note here how mixing English and Korean characters causes every character to be interpreted as a 2 byte wide character. Dunno if this sane; this may depend on Scala or JVM details that I wouldn't know about at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org