You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by willb <gi...@git.apache.org> on 2014/07/10 19:45:06 UTC

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

GitHub user willb opened a pull request:

    https://github.com/apache/spark/pull/1359

    SPARK-2407: Added internal implementation of SQL SUBSTR()

    This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
    and adds tests to verify correct operation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark internalSqlSubstring

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1359.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1359
    
----
commit 7d7b2631394d2b46f9753c0cf53e674b640eb82c
Author: William Benton <wi...@redhat.com>
Date:   2014-07-08T18:06:15Z

    Added internal implementation of SQL SUBSTR()
    
    This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
    and adds tests to verify correct operation.
    
    Squashes 8969d038 and e6419b4e.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48638949
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14802394
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (!resolved) {
    +      throw new UnresolvedException(this, s"Cannot resolve since $children are not resolved")
    +    }
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  @inline
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    --- End diff --
    
    Yes, I think I can do it with implicit parameters.  I'll push an update once I've run the Catalyst suite against my change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48685587
  
    QA tests have started for PR 1359. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16544/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48651897
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48653871
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48987029
  
    QA tests have started for PR 1359. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48648864
  
    This sounds pretty cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-49155240
  
    Awesome! Please submit a pull request with that addition.
    On Jul 16, 2014 7:53 AM, "Teng Qiu" <no...@github.com> wrote:
    
    > hi, it is really very useful for us, i tried this implementation from
    > @willb <https://github.com/willb> , in spark-shell, i still got
    > java.lang.UnsupportedOperationException by Query Plan, i made some change
    > in SqlParser: chutium@1de83a7
    > <https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144>
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1359#issuecomment-49154812>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14782267
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
    @@ -860,6 +860,7 @@ private[hive] object HiveQl {
       val BETWEEN = "(?i)BETWEEN".r
       val WHEN = "(?i)WHEN".r
       val CASE = "(?i)CASE".r
    +  val SUBSTR = "(?i)I_SUBSTR(?:ING)?".r
    --- End diff --
    
    Why "I_" here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48665810
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48639097
  
    QA tests have started for PR 1359. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14783468
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    --- End diff --
    
    Hive supports 0-based indexing in the same way as this patch.  I agree that supporting both in this way is ugly (both from an interface and from an implementation perspective), but it seems likely that people are depending on this behavior in the wild, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-49093085
  
    I'm going to go ahead and merge this.  Someone can make a follow-up with nullability narrowing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-49093280
  
    Thanks!  Merged into master and 1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14782682
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    +    // negative indices for start positions. If a start index i is greater than 0, it 
    +    // refers to element i-1 in the sequence. If a start index i is less than 0, it refers
    +    // to the -ith element before the end of the sequence. If a start index i is 0, it
    +    // refers to the first element.
    +    
    +    val start = startPos match {
    +      case pos if pos > 0 => pos - 1
    +      case neg if neg < 0 => len + neg
    +      case _ => 0
    +    }
    +    
    +    val end = sliceLen match {
    --- End diff --
    
    The MySQL doc mentions that "If len is less than 1, the result is the empty string."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48665812
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48971609
  
    Thanks!  Will merge as soon as Jenkins comes back happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14782197
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    +    // negative indices for start positions. If a start index i is greater than 0, it 
    +    // refers to element i-1 in the sequence. If a start index i is less than 0, it refers
    +    // to the -ith element before the end of the sequence. If a start index i is 0, it
    +    // refers to the first element.
    +    
    +    val start = startPos match {
    +      case pos if pos > 0 => pos - 1
    +      case neg if neg < 0 => len + neg
    +      case _ => 0
    +    }
    +    
    +    val end = sliceLen match {
    +      case max if max == Integer.MAX_VALUE => max
    +      case x => start + x
    +    }
    +      
    +    str.slice(start, end)    
    +  }
    +  
    +  override def eval(input: Row): Any = {
    +    val string = str.eval(input)
    +
    +    val po = pos.eval(input)
    +    val ln = len.eval(input)
    +    
    +    if ((string == null) || (po == null) || (ln == null)) {
    +      null
    +    } else {
    +      val start = po.asInstanceOf[Int]
    +      val length = ln.asInstanceOf[Int] 
    +      
    +      string match {
    +        case ba: Array[Byte] => slice(ba, start, length)
    +        case other => slice(other.toString, start, length)
    +      }
    +    }
    +  }
    +  
    +  override def toString = s"SUBSTR($str, $pos, $len)"
    --- End diff --
    
    Makes sense!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48654045
  
    QA tests have started for PR 1359. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48690227
  
    QA results for PR 1359:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16544/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14782431
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
    @@ -860,6 +860,7 @@ private[hive] object HiveQl {
       val BETWEEN = "(?i)BETWEEN".r
       val WHEN = "(?i)WHEN".r
       val CASE = "(?i)CASE".r
    +  val SUBSTR = "(?i)I_SUBSTR(?:ING)?".r
    --- End diff --
    
    Good catch -- the `I_` was spuriously committed.  (I had that in my working tree so that I could easily compare Catalyst and Hive `substr` from the console.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48638963
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48643276
  
    Thanks for the comments, @concretevitamin.  I've made the changes from your comments and will think about reducing branch overhead before pushing an update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48665803
  
    QA results for PR 1359:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48969318
  
    Hey Will, this looks great.  Thanks for implementing!  Mind fixing the conflict with master so I can merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1359


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48685347
  
    QA results for PR 1359:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14783724
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    +    // negative indices for start positions. If a start index i is greater than 0, it 
    +    // refers to element i-1 in the sequence. If a start index i is less than 0, it refers
    +    // to the -ith element before the end of the sequence. If a start index i is 0, it
    +    // refers to the first element.
    +    
    +    val start = startPos match {
    +      case pos if pos > 0 => pos - 1
    +      case neg if neg < 0 => len + neg
    +      case _ => 0
    +    }
    +    
    +    val end = sliceLen match {
    --- End diff --
    
    Right, missed this before, sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14782583
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    --- End diff --
    
    It looks like from [1] and [2] that Hive and SQL by default refer users to use 1-based indexing (and the links don't seem to mention 0-based at all). Even though they do support so, this subtle fact necessitates a couple branchings in the implementation which might cause us some performance penalty. What do you & people think about supporting 1-based indexing only? +@marmbrus @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48651882
  
    QA results for PR 1359:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14781579
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -19,8 +19,11 @@ package org.apache.spark.sql.catalyst.expressions
     
     import java.util.regex.Pattern
     
    +import scala.collection.IndexedSeqOptimized
    +
     import org.apache.spark.sql.catalyst.types.DataType
     import org.apache.spark.sql.catalyst.types.StringType
    +import org.apache.spark.sql.catalyst.types.BinaryType
    --- End diff --
    
    alphabetization or combine the imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14781690
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    +    // negative indices for start positions. If a start index i is greater than 0, it 
    +    // refers to element i-1 in the sequence. If a start index i is less than 0, it refers
    +    // to the -ith element before the end of the sequence. If a start index i is 0, it
    +    // refers to the first element.
    +    
    +    val start = startPos match {
    +      case pos if pos > 0 => pos - 1
    +      case neg if neg < 0 => len + neg
    +      case _ => 0
    +    }
    +    
    +    val end = sliceLen match {
    +      case max if max == Integer.MAX_VALUE => max
    +      case x => start + x
    +    }
    +      
    +    str.slice(start, end)    
    +  }
    +  
    +  override def eval(input: Row): Any = {
    +    val string = str.eval(input)
    +
    +    val po = pos.eval(input)
    +    val ln = len.eval(input)
    +    
    +    if ((string == null) || (po == null) || (ln == null)) {
    +      null
    +    } else {
    +      val start = po.asInstanceOf[Int]
    +      val length = ln.asInstanceOf[Int] 
    +      
    +      string match {
    +        case ba: Array[Byte] => slice(ba, start, length)
    +        case other => slice(other.toString, start, length)
    +      }
    +    }
    +  }
    +  
    +  override def toString = s"SUBSTR($str, $pos, $len)"
    --- End diff --
    
    Minor: what do you think if we display only two args for the 2-arg case instead of displaying a Integer.MAX_VALUE?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by chutium <gi...@git.apache.org>.
Github user chutium commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-49154812
  
    hi, it is really very useful for us, i tried this implementation from @willb , in spark-shell, i still got java.lang.UnsupportedOperationException by Query Plan, i made some change in SqlParser: https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by chutium <gi...@git.apache.org>.
Github user chutium commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-49159677
  
    PR submitted #1442 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48642397
  
    Hey @willb - thanks for working on this, which is going to be very useful for Spark SQL. I left a couple minor comments. Another general concern is the performance of `eval()`. If there are ways to reduce branchings, or reduce the function call (not sure if @inline will help), we should probably do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48986821
  
    add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14787712
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    --- End diff --
    
    (Another point is that we need to branch on the index value even if we eliminate 0-based indexing in order to handle substrings taken from the end of the string.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14783626
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    +    val len = str.length
    +    // Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and
    +    // negative indices for start positions. If a start index i is greater than 0, it 
    +    // refers to element i-1 in the sequence. If a start index i is less than 0, it refers
    +    // to the -ith element before the end of the sequence. If a start index i is 0, it
    +    // refers to the first element.
    +    
    +    val start = startPos match {
    +      case pos if pos > 0 => pos - 1
    +      case neg if neg < 0 => len + neg
    +      case _ => 0
    +    }
    +    
    +    val end = sliceLen match {
    --- End diff --
    
    This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this patch) as well as of Hive, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14915562
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -205,3 +207,72 @@ case class EndsWith(left: Expression, right: Expression)
         extends BinaryExpression with StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    --- End diff --
    
    `nullable` could be `str.nullable || pos.nullable || len.nullable` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48685301
  
    QA tests have started for PR 1359. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48971559
  
    Thanks, Michael.  I've fixed the conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14801757
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (!resolved) {
    +      throw new UnresolvedException(this, s"Cannot resolve since $children are not resolved")
    +    }
    +    if (str.dataType == BinaryType) str.dataType else StringType
    +  }
    +  
    +  def references = children.flatMap(_.references).toSet
    +  
    +  override def children = str :: pos :: len :: Nil
    +  
    +  @inline
    +  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
    --- End diff --
    
    Would it be possible to do this without view bounds?  I think those are going to disappear soon: https://issues.scala-lang.org/browse/SI-7629


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48651895
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by willb <gi...@git.apache.org>.
Github user willb commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48648670
  
    @concretevitamin Inlining `Substring.slice` seems to make a nontrivial difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48653881
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1359#issuecomment-48991417
  
    QA results for PR 1359:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1359#discussion_r14781542
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala ---
    @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari
     case class EndsWith(left: Expression, right: Expression) extends StringComparison {
       def compare(l: String, r: String) = l.endsWith(r)
     }
    +
    +/**
    + * A function that takes a substring of its first argument starting at a given position.
    + * Defined for String and Binary types.
    + */
    +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {
    +  
    +  type EvaluatedType = Any
    +  
    +  def nullable: Boolean = true
    +  def dataType: DataType = {
    +    if (str.dataType == BinaryType) str.dataType else StringType
    --- End diff --
    
    Hey Will -- I think we need to check if `resolved` is true here to not break Catalyst's contract. Similar to [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L183).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---