You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by phegstrom <gi...@git.apache.org> on 2018/08/24 20:55:12 UTC

[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

GitHub user phegstrom opened a pull request:

    https://github.com/apache/spark/pull/22227

    [SPARK-25202] [Core] Implements split with limit sql function

    ## What changes were proposed in this pull request?
    
    Adds support for the setting limit in the sql split function
    
    ## How was this patch tested?
    
    1. Updated unit tests
    2. Tested using Scala spark shell
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/phegstrom/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22227
    
----
commit 15362be4764d2b33757488efe38667cc762246ad
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-24T19:19:58Z

    implement split with limit

commit ceb3f41238c8731606164cea5c45a0b87bb5d6f2
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-24T20:26:42Z

    linting

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214561691
  
    --- Diff: sql/core/src/test/resources/sql-tests/inputs/string-functions.sql ---
    @@ -46,4 +46,10 @@ FROM (
         encode(string(id + 2), 'utf-8') col3,
         encode(string(id + 3), 'utf-8') col4
       FROM range(10)
    -)
    +);
    +
    +-- split function
    +select split('aa1cc2ee3', '[1-9]+');
    --- End diff --
    
    I would use upper cases here for keywords.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96112/testReport)** for PR 22227 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95476/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96015/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96826/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    hmm @HyukjinKwon any ideas what's happening?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95563/testReport)** for PR 22227 at commit [`d80b1a1`](https://github.com/apache/spark/commit/d80b1a15ed8941bad78df2c5f7168a4196d27be4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217563366
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1671,18 +1671,32 @@ def repeat(col, n):
     
     @since(1.5)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
     
    -    .. note:: pattern is a string represent the regular expression.
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    --- End diff --
    
    Shall we make it four-spaced.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    LGTM except for the R/python parts (I'm not familiar with these parts and I'll leave them to @felixcheung and @HyukjinKwon).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Seems fine otherwise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213063991
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    --- End diff --
    
    ^ ignore this! found it @maropu 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    You need to handle that in both codegen(doGenCode) and interpreter(nullSafeEval) path. Also, can you add tests to check if they have the same behaviour in the limit=0/limit=-1 cases? You'd be better to explicitly put comments in a proper position about the decision.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214497013
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,74 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    +       ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, regex: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: regex :: limit :: Nil
    +
    +  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    +    val strings = string.asInstanceOf[UTF8String].split(
    +      regex.asInstanceOf[UTF8String], maybeFallbackLimitValue(limit.asInstanceOf[Int]))
         new GenericArrayData(strings.asInstanceOf[Array[Any]])
       }
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         val arrayClass = classOf[GenericArrayData].getName
    -    nullSafeCodeGen(ctx, ev, (str, pattern) =>
    +    nullSafeCodeGen(ctx, ev, (str, regex, limit) => {
           // Array in java is covariant, so we don't need to cast UTF8String[] to Object[].
    -      s"""${ev.value} = new $arrayClass($str.split($pattern, -1));""")
    +      s"""${ev.value} = new $arrayClass($str.split(
    +         $regex,${handleCodeGenLimitFallback(limit)}));""".stripMargin
    +    })
       }
     
       override def prettyName: String = "split"
    +
    +  /**
    +   * Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +   * To avoid this, we fall back to -1 when the limit is 0. Otherwise, this is a noop.
    +   */
    +  def maybeFallbackLimitValue(limit: Int): Int = {
    --- End diff --
    
    +1, and please add `limit = 0` case in `UTF8StringSuite`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95494/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216941249
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,51 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0
    +   *            <ul>
    +   *              <li>
    +   *                The resulting array's length will not be more than limit,
    +   *                and the resulting array's last entry will contain all input
    +   *                beyond the last matched regex.
    +   *             </li>
    +   *            </ul>
    +   *          </li>
    +   *          <li>limit less than or equal to 0
    +   *            <ul>
    +   *              <li>
    +   *                `regex` will be applied as many times as possible,
    +   *                and the resulting array can be of any size.
    +   *              </li>
    +   *            </ul>
    +   *          </li>
    +   *        </ul>
    --- End diff --
    
    I think you can just:
    
    ```
       *        <ul>
       *          <li>limit greater than 0: The resulting array's length will not be more than limit,
       *          and the resulting array's last entry will contain all input
       *          beyond the last matched regex.</li>
       *          <li>limit less than or equal to 0: `regex` will be applied as many times as possible,
       *          and the resulting array can be of any size.</li>
       *        </ul>
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214245829
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,37 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *    limit greater than 0: The resulting array's length will not be more than `limit`,
    +   *                          and the resulting array's last entry will contain all input beyond
    +   *                          the last matched regex.
    +   *    limit less than or equal to 0: `regex` will be applied as many times as possible, and
    +   *                       the resulting array can be of any size.
    --- End diff --
    
    Indentation here looks a bit odd and looks inconsistent at least. Can you double check Scaladoc and format this correctly?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217563726
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,53 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array with a length of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +          * limit > 0: The resulting array's length will not be more than `limit`,
    +                     and the resulting array's last entry will contain all input
    +                     beyond the last matched regex.
    --- End diff --
    
    indentation:
    
    ```
    * limit > 0: The resulting array's length will not be more than `limit`,
      and the resulting array's last entry will contain all input
      beyond the last matched regex.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96966/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562121
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,33 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
    +
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer expression which controls the number of times the pattern is applied.
     
    -    .. note:: pattern is a string represent the regular expression.
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +                             resulting array's last entry will contain all input beyond the last
    +                             matched pattern.
    +            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +                              array can be of any size.
    --- End diff --
    
    Indentation:
    
    ```diff
    -            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    -                             resulting array's last entry will contain all input beyond the last
    -                             matched pattern.
    -            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    -                              array can be of any size.
    +        * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +          resulting array's last entry will contain all input beyond the last
    +          matched pattern.
    +        * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +          array can be of any size.
    ```
    
    Did you check the HTML output?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221647616
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    ah, I'll look into that


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221645288
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    I mean we may not need ending tag `</li>`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95486/testReport)** for PR 22227 at commit [`79599eb`](https://github.com/apache/spark/commit/79599ebc26f089737e101b417428ceb6620f802d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95987/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96015/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r219714851
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    +#'              \itemize{
    +#'              \item \code{limit > 0}: length of the array will be at most \code{limit}
    +#'              \item \code{limit <= 0}: the returned array can have any length
    +#'              }
    +#'
     #' @aliases split_string split_string,Column-method
     #' @examples
     #'
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
     #' # This is equivalent to the following SQL expression
     #' head(selectExpr(df, "split(Class, '\\\\d')"))}
    --- End diff --
    
    good point - also the example should run in the order documented.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95774/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Also, you need to update split in python and R.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom closed the pull request at:

    https://github.com/apache/spark/pull/22227


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215764191
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,33 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
    +
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer expression which controls the number of times the pattern is applied.
     
    -    .. note:: pattern is a string represent the regular expression.
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +                             resulting array's last entry will contain all input beyond the last
    +                             matched pattern.
    +            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +                              array can be of any size.
    --- End diff --
    
    I did, see attached! Let me know what you think (unsure why initial description of limit starts on a new line):
    ![screen shot 2018-09-06 at 4 18 48 pm](https://user-images.githubusercontent.com/5938022/45182990-ce86ac80-b1f0-11e8-8cae-721d9936526c.png)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95774/testReport)** for PR 22227 at commit [`64b0afc`](https://github.com/apache/spark/commit/64b0afca802c4557b5a53aa62b7486c3d8d4fe8c).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212782784
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    --- End diff --
    
    Can you refine the description and the format along with the others, e.g., `RLike`
    https://github.com/apache/spark/blob/ceb3f41238c8731606164cea5c45a0b87bb5d6f2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L78


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221662234
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    Ok. Then it's fine. Thanks for looking at it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217930978
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,18 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", -2)))[1, 1],
    +    list(list("a.", "@c.d   1\\", ""))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", 0)))[1, 1],
    --- End diff --
    
    per @felixcheung's I added back the `limit = 0` case


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95578/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214516644
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,74 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    +       ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, regex: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: regex :: limit :: Nil
    +
    +  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    +    val strings = string.asInstanceOf[UTF8String].split(
    +      regex.asInstanceOf[UTF8String], maybeFallbackLimitValue(limit.asInstanceOf[Int]))
         new GenericArrayData(strings.asInstanceOf[Array[Any]])
       }
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         val arrayClass = classOf[GenericArrayData].getName
    -    nullSafeCodeGen(ctx, ev, (str, pattern) =>
    +    nullSafeCodeGen(ctx, ev, (str, regex, limit) => {
           // Array in java is covariant, so we don't need to cast UTF8String[] to Object[].
    -      s"""${ev.value} = new $arrayClass($str.split($pattern, -1));""")
    +      s"""${ev.value} = new $arrayClass($str.split(
    +         $regex,${handleCodeGenLimitFallback(limit)}));""".stripMargin
    +    })
       }
     
       override def prettyName: String = "split"
    +
    +  /**
    +   * Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +   * To avoid this, we fall back to -1 when the limit is 0. Otherwise, this is a noop.
    +   */
    +  def maybeFallbackLimitValue(limit: Int): Int = {
    --- End diff --
    
    done!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221652521
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    @viirya throughout this repository, the `</li>` has always been included. For consistency, I think we should just keep it as is


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    not `[CORE]` but `[SQL]` in the title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95907/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96826/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r219691017
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    I think we don't need `</li>`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213539767
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * pattern - a string representing a regular expression. The pattern string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the pattern is applied.
    +
    +        limit > 0:
    +          The resulting array's length will not be more than `limit`, and the resulting array's
    +          last entry will contain all input beyond the last matched pattern.
    +
    +        limit < 0:
    +          `pattern` will be applied as many times as possible, and the resulting
    +          array can be of any size.
    +
    +        limit = 0:
    +          `pattern` will be applied as many times as possible, the resulting array can
    +          be of any size, and trailing empty strings will be discarded.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
    --- End diff --
    
    drop `|`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96112/testReport)** for PR 22227 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216247365
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,24 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the size of the returned array. If `limit` is positive,
    +#'        size of the array will be at most `limit`. If `limit` is negative, the
    --- End diff --
    
    will do


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95872/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Seems okay


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    hey @HyukjinKwon, I'm seeing two tests fail in the most recent run, but traces don't seem related to any of my changes -- not to mention the test name is `(It is not a test it is a sbt.testing.SuiteSelector)`. Is this something I need to worry about?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @HyukjinKwon are things passing?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212986332
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
    --- End diff --
    
    @viirya i'll take a crack at it -- the usage is a bit funky given the different behavior based on what `limit` is, I aired on the side of verbose


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95311/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213061265
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    --- End diff --
    
    @maropu which tests use `string-functions.sql`? would like to add tests here but not sure how to explicitly kick off the test as there are no `*Suites` which use this file it seems.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213539793
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * pattern - a string representing a regular expression. The pattern string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the pattern is applied.
    +
    +        limit > 0:
    +          The resulting array's length will not be more than `limit`, and the resulting array's
    +          last entry will contain all input beyond the last matched pattern.
    +
    +        limit < 0:
    +          `pattern` will be applied as many times as possible, and the resulting
    +          array can be of any size.
    +
    +        limit = 0:
    +          `pattern` will be applied as many times as possible, the resulting array can
    +          be of any size, and trailing empty strings will be discarded.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
    +       ["one","two","three"]
    +|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213538010
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * pattern - a string representing a regular expression. The pattern string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the pattern is applied.
    +
    +        limit > 0:
    +          The resulting array's length will not be more than `limit`, and the resulting array's
    +          last entry will contain all input beyond the last matched pattern.
    +
    +        limit < 0:
    +          `pattern` will be applied as many times as possible, and the resulting
    +          array can be of any size.
    +
    +        limit = 0:
    +          `pattern` will be applied as many times as possible, the resulting array can
    +          be of any size, and trailing empty strings will be discarded.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
    +       ["one","two","three"]
    +|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
    --- End diff --
    
    Add the netative case?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95494/testReport)** for PR 22227 at commit [`fa128db`](https://github.com/apache/spark/commit/fa128db516be00a145af58a595143d7ccc31b7a4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214563339
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3410,13 +3410,15 @@ setMethod("collect_set",
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
    --- End diff --
    
    We should add documentation for R side too. Please document `limit` here 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214491144
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,74 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    +       ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, regex: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: regex :: limit :: Nil
    +
    +  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    +    val strings = string.asInstanceOf[UTF8String].split(
    +      regex.asInstanceOf[UTF8String], maybeFallbackLimitValue(limit.asInstanceOf[Int]))
         new GenericArrayData(strings.asInstanceOf[Array[Any]])
       }
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         val arrayClass = classOf[GenericArrayData].getName
    -    nullSafeCodeGen(ctx, ev, (str, pattern) =>
    +    nullSafeCodeGen(ctx, ev, (str, regex, limit) => {
           // Array in java is covariant, so we don't need to cast UTF8String[] to Object[].
    -      s"""${ev.value} = new $arrayClass($str.split($pattern, -1));""")
    +      s"""${ev.value} = new $arrayClass($str.split(
    +         $regex,${handleCodeGenLimitFallback(limit)}));""".stripMargin
    +    })
       }
     
       override def prettyName: String = "split"
    +
    +  /**
    +   * Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +   * To avoid this, we fall back to -1 when the limit is 0. Otherwise, this is a noop.
    +   */
    +  def maybeFallbackLimitValue(limit: Int): Int = {
    --- End diff --
    
    To make all the split behaviour consistent in Spark, how about trying to fix code in `UTF8String.split` directly? (We need to check if the change does not break the existing behaviour, too) cc: @ueshin @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96075/testReport)** for PR 22227 at commit [`b5994ad`](https://github.com/apache/spark/commit/b5994ad66c355cf104c1bd4b1e75ae2e6cc199c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216122621
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,24 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the size of the returned array. If `limit` is positive,
    +#'        size of the array will be at most `limit`. If `limit` is negative, the
    --- End diff --
    
    you can't use backtick in R doc


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213519898
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2554,7 +2554,27 @@ object functions {
        * @since 1.5.0
        */
       def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +    StringSplit(str.expr, Literal(pattern), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around pattern (pattern is a regular expression).
    +   *
    +   * The limit parameter controls the number of times the pattern is applied and therefore
    +   * affects the length of the resulting array. If the limit n is greater than zero then the
    +   * pattern will be applied at most n - 1 times, the array's length will be no greater than
    +   * n, and the array's last entry will contain all input beyond the last matched delimiter.
    +   * If n is non-positive then the pattern will be applied as many times as possible and the
    +   * array can have any length. If n is zero then the pattern will be applied as many times as
    +   * possible, the array can have any length, and trailing empty strings will be discarded.
    --- End diff --
    
    Can you copy SQL's doc here? You could describe them via `@param` here as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95872/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by dilipbiswal <gi...@git.apache.org>.
Github user dilipbiswal commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96021/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562895
  
    --- Diff: sql/core/src/test/resources/sql-tests/inputs/string-functions.sql ---
    @@ -46,4 +46,10 @@ FROM (
         encode(string(id + 2), 'utf-8') col3,
         encode(string(id + 3), 'utf-8') col4
       FROM range(10)
    -)
    +);
    +
    +-- split function
    +select split('aa1cc2ee3', '[1-9]+');
    +select split('aa1cc2ee3', '[1-9]+', -1);
    +select split('aa1cc2ee3', '[1-9]+', 0);
    +select split('aa1cc2ee3', '[1-9]+', 2);
    --- End diff --
    
    Likewise, I don't think we should add test cases for all cases. Just one test case and check if they are called fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95558/testReport)** for PR 22227 at commit [`7e4ba98`](https://github.com/apache/spark/commit/7e4ba981ef4636b5663f1a50df6e3fa8886186c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96030/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215630920
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    --- End diff --
    
    by having it as `regex`, the documentation will require less explanation. @HyukjinKwon if you are ok with keeping it as `regex` I think we should prefer to keep this change.
    
    Happy to revert as well of course


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95511/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95493/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212977986
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: pattern :: limit :: Nil
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    --- End diff --
    
    @viirya the underlying implementation of this method is `Java.lang.String`, correct? This method does allow non-positive values for limit, not sure what Presto is using. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96966/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214566285
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3410,13 +3410,15 @@ setMethod("collect_set",
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
    --- End diff --
    
    The current build failure:
    
    ```
    Undocumented arguments in documentation object 'column_string_functions'
      'limit'
    
    Functions with \usage entries need to have the appropriate \alias
    entries, and all their arguments documented.
    The \usage entries must correspond to syntactically valid R code.
    See the chapter 'Writing R documentation files' in the 'Writing R
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212783216
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    --- End diff --
    
    For test coverage, better to add tests in `string-functions.sql`  for the two cases: two arguments and three arguments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214358111
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,36 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    -    """
    -    Splits str around pattern (pattern is a regular expression).
    -
    -    .. note:: pattern is a string represent the regular expression.
    -
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    -    """
    -    sc = SparkContext._active_spark_context
    -    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
    +def split(str, regex, limit=-1):
    --- End diff --
    
    @HyukjinKwon do you want `regex` -> `pattern` just here in python or every where in this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562760
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    --- End diff --
    
    I would do this:
    
    ```
            * limit > 0: The resulting array's length will not be more than `limit`,
               and the resulting array's last entry will contain all input
               beyond the last matched regex.
            * limit <= 0: `regex` will be applied as many times as possible, and
               the resulting array can be of any size.
    ```
    
    This is just markdown rendered by mkdocs; so probably, it's better to make it consistent with other docs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95563/testReport)** for PR 22227 at commit [`d80b1a1`](https://github.com/apache/spark/commit/d80b1a15ed8941bad78df2c5f7168a4196d27be4).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96075/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95470/testReport)** for PR 22227 at commit [`8e16328`](https://github.com/apache/spark/commit/8e163283a71fa272ae8d7c43632e38ef01f8d12b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214245703
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,36 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    -    """
    -    Splits str around pattern (pattern is a regular expression).
    -
    -    .. note:: pattern is a string represent the regular expression.
    -
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    -    """
    -    sc = SparkContext._active_spark_context
    -    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
    +def split(str, regex, limit=-1):
    --- End diff --
    
    Please change `regex ` back to `pattern`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95511/testReport)** for PR 22227 at commit [`a641106`](https://github.com/apache/spark/commit/a6411069c352b30f9094a83991c35f0730b5df55).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212977742
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    --- End diff --
    
    @rxin the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for `limit`, not sure what Presto is using. The text I've put here corresponds with the definition (rather long) from Java.lang.String.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r219690753
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    +#'              \itemize{
    +#'              \item \code{limit > 0}: length of the array will be at most \code{limit}
    +#'              \item \code{limit <= 0}: the returned array can have any length
    +#'              }
    +#'
     #' @aliases split_string split_string,Column-method
     #' @examples
     #'
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
     #' # This is equivalent to the following SQL expression
     #' head(selectExpr(df, "split(Class, '\\\\d')"))}
    --- End diff --
    
    hmm i think L3418 shall be followed by L3420?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95803 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95803/testReport)** for PR 22227 at commit [`4e84df0`](https://github.com/apache/spark/commit/4e84df0a80c1e610068884f937b73478be7e1c1c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96481/testReport)** for PR 22227 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95237/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96015/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217563572
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1671,18 +1671,32 @@ def repeat(col, n):
     
     @since(1.5)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
     
    -    .. note:: pattern is a string represent the regular expression.
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer which controls the number of times `pattern` is applied.
     
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    --- End diff --
    
    Let's make it 4 spaced too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95918/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214135400
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`, and the resulting
    +                   array's last entry will contain all input beyond the last matched regex.
    +
    +        limit < 0: `regex` will be applied as many times as possible, and the resulting
    +                   array can be of any size.
    +
    +        limit = 0: `regex` will be applied as many times as possible, the resulting array can
    --- End diff --
    
    yea but i'd focus on what behavior we want to enable. do other database systems have this split=0 semantics? if not, i'd rewrite split=0 internally to just -1.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215124064
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    --- End diff --
    
    Yea, I don't think we should change the name in case either makes sense in a way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214563659
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    --- End diff --
    
    Current bullt doc is:
    
    ![screen shot 2018-09-03 at 10 04 46 am](https://user-images.githubusercontent.com/6477701/44964013-e47e3e00-af60-11e8-975a-41d5dc5f1e6e.png)
    
    which doesn;t look quite nice. Please verify the output by referring https://github.com/apache/spark/tree/master/docs


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95311/testReport)** for PR 22227 at commit [`4e10733`](https://github.com/apache/spark/commit/4e107337a47ce590c703b757b0a44d60d6b862e1).
     * This patch **fails from timeout after a configured wait of \`400m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @phegstrom Thanks for your contribution!
    Btw, seems like your email address in your commits is not connected to your GitHub account. Could you connect the address to your account, or use other address connected to your account? Otherwise, your contribution will not be connected to you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96966/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214926165
  
    --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
    @@ -394,12 +394,14 @@ public void substringSQL() {
     
       @Test
       public void split() {
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), -1),
    -      new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    +    UTF8String[] negativeAndZeroLimitCase =
    +            new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi"), fromString("")};
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), 0),
    +            negativeAndZeroLimitCase));
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), -1),
    --- End diff --
    
    @HyukjinKwon the last two were duplicates:
    ```
          new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
        assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
          new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    ```
    
    And I also thought it better to include the case where you do get an empty string (adding one more instance of the regex at the end). Want me to revert? My view is it's more exhaustive of the expected behavior, and also easier to see that limit = -1 should behave exactly like limit = 0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95486/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95473/testReport)** for PR 22227 at commit [`ca23ea3`](https://github.com/apache/spark/commit/ca23ea363fe2beaecc4cf16385256aabbdd7d626).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95558/testReport)** for PR 22227 at commit [`7e4ba98`](https://github.com/apache/spark/commit/7e4ba981ef4636b5663f1a50df6e3fa8886186c3).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212815703
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    --- End diff --
    
    you should say if limit is ignored if it is a non-positive number.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217563904
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1671,18 +1671,32 @@ def repeat(col, n):
     
     @since(1.5)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
     
    -    .. note:: pattern is a string represent the regular expression.
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer which controls the number of times `pattern` is applied.
     
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +                             resulting array's last entry will contain all input beyond the last
    +                             matched pattern.
    +            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +                              array can be of any size.
    +
    +    .. versionchanged:: 3.0
    +       `split` now takes an optional `limit` field. If not provided, default limit value is -1.
    +
    +    >>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',])
    +    >>> df.select(split(df.s, '[ABC]', 2).alias('s')).collect()
    +    [Row(s=[u'one', u'twoBthreeC'])]
    +    >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
    --- End diff --
    
    Let's turn into this an example without limit argument.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95987/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95803/testReport)** for PR 22227 at commit [`4e84df0`](https://github.com/apache/spark/commit/4e84df0a80c1e610068884f937b73478be7e1c1c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217953294
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,18 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", -2)))[1, 1],
    +    list(list("a.", "@c.d   1\\", ""))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", 0)))[1, 1],
    --- End diff --
    
    for context, we've had some cases in the past the wrong value is passed for an parameter - so let's at least get one with and one without any optional parameter


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    
    retest this please
    --
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214561410
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,33 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
    +
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer expression which controls the number of times the pattern is applied.
     
    -    .. note:: pattern is a string represent the regular expression.
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +                             resulting array's last entry will contain all input beyond the last
    +                             matched pattern.
    +            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +                              array can be of any size.
     
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    +    >>> df = spark.createDataFrame([('oneAtwoBthreeC',)], ['s',])
    +    >>> df.select(split(df.s, '[ABC]', 2).alias('s')).collect()
    +    [Row(s=[u'one', u'twoBthreeC'])]
    +    >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
    +    [Row(s=[u'one', u'two', u'three', u''])]
    +    >>> df.select(split(df.s, '[ABC]', 0).alias('s')).collect()
    --- End diff --
    
    I wouldn't have this test since we now don't have a specific behaviour to 0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95578/testReport)** for PR 22227 at commit [`d17d2df`](https://github.com/apache/spark/commit/d17d2df530a9236471b415700533c9224c94435a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96075/testReport)** for PR 22227 at commit [`b5994ad`](https://github.com/apache/spark/commit/b5994ad66c355cf104c1bd4b1e75ae2e6cc199c3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95486/testReport)** for PR 22227 at commit [`79599eb`](https://github.com/apache/spark/commit/79599ebc26f089737e101b417428ceb6620f802d).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r218062906
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,18 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", -2)))[1, 1],
    +    list(list("a.", "@c.d   1\\", ""))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", 0)))[1, 1],
    --- End diff --
    
    @felixcheung just to confirm, things look ok here to you? We now have both with/without optional parameter test cases


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
GitHub user phegstrom reopened a pull request:

    https://github.com/apache/spark/pull/22227

    [SPARK-25202] [SQL] Implements split with limit sql function

    ## What changes were proposed in this pull request?
    
    Adds support for the setting limit in the sql split function
    
    ## How was this patch tested?
    
    1. Updated unit tests
    2. Tested using Scala spark shell
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/phegstrom/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22227
    
----
commit 15362be4764d2b33757488efe38667cc762246ad
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-24T19:19:58Z

    implement split with limit

commit ceb3f41238c8731606164cea5c45a0b87bb5d6f2
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-24T20:26:42Z

    linting

commit e564a68ce8db001f5ebcec49bbeb8a4b893a8d70
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-27T20:27:45Z

    most comments

commit 4e107337a47ce590c703b757b0a44d60d6b862e1
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-27T21:01:14Z

    sql function tests

commit 5135cb28e505f5c61ce325ac6e5589b35450b44a
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-28T21:52:25Z

    fixing test file, comments

commit e8c8c8c5e3811fd8e8f2c7b8c77ae404b4acc157
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-28T21:57:24Z

    adding another example

commit 8e163283a71fa272ae8d7c43632e38ef01f8d12b
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T13:57:53Z

    better docs, fixing sql output files

commit ca23ea363fe2beaecc4cf16385256aabbdd7d626
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T15:01:36Z

    adding python support

commit 96bc875d4790f46f72e30c4845c25ed5c4dc73cc
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T15:22:17Z

    fixing style checks

commit 79599ebc26f089737e101b417428ceb6620f802d
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T18:46:01Z

    adding support for R

commit a27c848a62a08c869699bd521092987672337926
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T21:07:28Z

    ammending doc format

commit fa128db516be00a145af58a595143d7ccc31b7a4
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-30T21:13:06Z

    fixing typo

commit a6411069c352b30f9094a83991c35f0730b5df55
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-31T01:58:07Z

    fix docs to check ci status

commit 7e4ba981ef4636b5663f1a50df6e3fa8886186c3
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-31T17:44:09Z

    add fallback from limit = 0, update tests

commit d80b1a15ed8941bad78df2c5f7168a4196d27be4
Author: Parker Hegstrom <ph...@...>
Date:   2018-08-31T21:19:11Z

    fix python syntax in tests

commit d17d2df530a9236471b415700533c9224c94435a
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-01T15:39:04Z

    bring limit handling into UTF8String split impl

commit 64b0afca802c4557b5a53aa62b7486c3d8d4fe8c
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-06T19:57:37Z

    HyukjinKwon comments

commit 4e84df0a80c1e610068884f937b73478be7e1c1c
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-07T16:16:17Z

    removing duplicate pattern reference

commit b12ee881c1d9025644d075a6124f9e6e465a6378
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-10T10:04:28Z

    docs comments, change to 3.0

commit 69d219018c9b2a7f3fb7dd716619f067aec0d2dc
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-12T11:58:42Z

    docs comments

commit b5994ad66c355cf104c1bd4b1e75ae2e6cc199c3
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-14T10:55:55Z

    docs comments, should be good to go

commit 5c8f48715748bdeda703761fba6a4d1828a19985
Author: Parker Hegstrom <ph...@...>
Date:   2018-09-16T22:46:43Z

    felix comments for R tests

commit 34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108
Author: Parker Hegstrom <ph...@...>
Date:   2018-10-01T18:18:06Z

    viirya comments

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95918/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    long thread, are we all good with this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95473/testReport)** for PR 22227 at commit [`ca23ea3`](https://github.com/apache/spark/commit/ca23ea363fe2beaecc4cf16385256aabbdd7d626).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96021/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96481/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96981/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95558/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96481/testReport)** for PR 22227 at commit [`5c8f487`](https://github.com/apache/spark/commit/5c8f48715748bdeda703761fba6a4d1828a19985).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95388/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95476/testReport)** for PR 22227 at commit [`96bc875`](https://github.com/apache/spark/commit/96bc875d4790f46f72e30c4845c25ed5c4dc73cc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215705304
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    --- End diff --
    
    got it, nvm


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96030/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214244918
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3410,13 +3410,14 @@ setMethod("collect_set",
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
     #' # This is equivalent to the following SQL expression
     #' head(selectExpr(df, "split(Class, '\\\\d')"))}
     #' @note split_string 2.3.0
     setMethod("split_string",
               signature(x = "Column", pattern = "character"),
    -          function(x, pattern) {
    -            jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern)
    +          function(x, pattern, limit = -1) {
    +            jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern, limit)
    --- End diff --
    
    you should have `as.integer(limit)` instead
    could we add a test in R?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @phegstrom, can you close and reopen this PR to retrigger the AppVeyor test above? Looks it's failed due to time limitation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221706055
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    --- End diff --
    
    going to include this in the `@details` section, as other functions like `ltrim` handle optionality of one of its params there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214134851
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`, and the resulting
    +                   array's last entry will contain all input beyond the last matched regex.
    +
    +        limit < 0: `regex` will be applied as many times as possible, and the resulting
    +                   array can be of any size.
    +
    +        limit = 0: `regex` will be applied as many times as possible, the resulting array can
    --- End diff --
    
    @rxin this is default behavior for java lang string's `split` implementation. https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int), nothing has been specifically built into it here.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221641267
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    +#'              \itemize{
    +#'              \item \code{limit > 0}: length of the array will be at most \code{limit}
    +#'              \item \code{limit <= 0}: the returned array can have any length
    +#'              }
    +#'
     #' @aliases split_string split_string,Column-method
     #' @examples
     #'
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
     #' # This is equivalent to the following SQL expression
     #' head(selectExpr(df, "split(Class, '\\\\d')"))}
    --- End diff --
    
    yes will make that change @viirya @felixcheung 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95237/testReport)** for PR 22227 at commit [`ceb3f41`](https://github.com/apache/spark/commit/ceb3f41238c8731606164cea5c45a0b87bb5d6f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212783356
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
    --- End diff --
    
    hmm, is it possible to make this usage more compact? I think the usage here should be concise.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96981/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95563/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Can you add tests in `StringFunctionsSuite`, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213154483
  
    --- Diff: sql/core/src/test/resources/sql-tests/inputs/string-functions.sql ---
    @@ -5,6 +5,10 @@ select format_string();
     -- A pipe operator for string concatenation
     select 'a' || 'b' || 'c';
     
    +-- split function
    +select split('aa1cc2ee', '[1-9]+', 2);
    +select split('aa1cc2ee', '[1-9]+');
    +
    --- End diff --
    
    Can you move these tests to the end of this file in order to decrease unnecessary changes in the golden file.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214244981
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,36 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    -    """
    -    Splits str around pattern (pattern is a regular expression).
    -
    -    .. note:: pattern is a string represent the regular expression.
    -
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    -    """
    -    sc = SparkContext._active_spark_context
    -    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
    +def split(str, regex, limit=-1):
    --- End diff --
    
    this would be a breaking API change I believe for python


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562525
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    +       ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, regex: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: regex :: limit :: Nil
    +
    +  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    +    val strings = string.asInstanceOf[UTF8String].split(
    +      regex.asInstanceOf[UTF8String], limit.asInstanceOf[Int])
         new GenericArrayData(strings.asInstanceOf[Array[Any]])
       }
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         val arrayClass = classOf[GenericArrayData].getName
    -    nullSafeCodeGen(ctx, ev, (str, pattern) =>
    +    nullSafeCodeGen(ctx, ev, (str, regex, limit) => {
           // Array in java is covariant, so we don't need to cast UTF8String[] to Object[].
    -      s"""${ev.value} = new $arrayClass($str.split($pattern, -1));""")
    +      s"""${ev.value} = new $arrayClass($str.split($regex,$limit));""".stripMargin
    +    })
       }
     
       override def prettyName: String = "split"
    +
    --- End diff --
    
    Not a big deal but let's revert unrelated newline change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216185688
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,24 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the size of the returned array. If `limit` is positive,
    +#'        size of the array will be at most `limit`. If `limit` is negative, the
    --- End diff --
    
    Let's also match the doc. You can refer https://github.com/apache/spark/blob/4e84df0a80c1e610068884f937b73478be7e1c1c/R/pkg/R/functions.R#L186-L191
    
    for list


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216185526
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *        <li>limit greater than 0: The resulting array's length will not be more than `limit`,
    +   *                              and the resulting array's last entry will contain all input
    +   *                              beyond the last matched regex.</li>
    +   *        <li>limit less than or equal to 0: `regex` will be applied as many times as possible,
    +   *                                           and the resulting array can be of any size.</li>
    +   *        </ul>
    +   *
    +   * @group string_funcs
    +   * @since 2.4.0
    --- End diff --
    
    ditto for 3.0.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95494/testReport)** for PR 22227 at commit [`fa128db`](https://github.com/apache/spark/commit/fa128db516be00a145af58a595143d7ccc31b7a4).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215682813
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    --- End diff --
    
    @HyukjinKwon sorry, but where can I get the markdown output of the `regexExpressions.scala`? When I run unidoc, I'm only seeing the html corresponding to python, java, r (and these are from the functions, not from `regexExpressions.scala`.
    
    I'm running 
    ```
    build/sbt unidoc
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Please remove the 0 semantics. IMO the zero vs negative number difference is too subtle. I only find Java String supporting that. Python doesn't. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214131195
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`, and the resulting
    +                   array's last entry will contain all input beyond the last matched regex.
    +
    +        limit < 0: `regex` will be applied as many times as possible, and the resulting
    +                   array can be of any size.
    +
    +        limit = 0: `regex` will be applied as many times as possible, the resulting array can
    --- End diff --
    
    why do we want to build into split to remove trailing empty strings?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r221641565
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0: The resulting array's length will not be more than limit,
    +   *          and the resulting array's last entry will contain all input beyond the last
    +   *          matched regex.</li>
    +   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
    +   *          possible, and the resulting array can be of any size.</li>
    --- End diff --
    
    I was asked to do `<li>` earlier in this PR conversation. @HyukjinKwon -- thoughts here? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96981/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95476/testReport)** for PR 22227 at commit [`96bc875`](https://github.com/apache/spark/commit/96bc875d4790f46f72e30c4845c25ed5c4dc73cc).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95470/testReport)** for PR 22227 at commit [`8e16328`](https://github.com/apache/spark/commit/8e163283a71fa272ae8d7c43632e38ef01f8d12b).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class StringSplit(str: Expression, regex: Expression, limit: Expression)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95511/testReport)** for PR 22227 at commit [`a641106`](https://github.com/apache/spark/commit/a6411069c352b30f9094a83991c35f0730b5df55).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217901635
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,10 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    --- End diff --
    
    let's add a test for `limit = 0` or `limit = -1` too - while it's the default value, is any of the test cases changes behavior for limit = -1?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214358771
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,37 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *    limit greater than 0: The resulting array's length will not be more than `limit`,
    +   *                          and the resulting array's last entry will contain all input beyond
    +   *                          the last matched regex.
    +   *    limit less than or equal to 0: `regex` will be applied as many times as possible, and
    +   *                       the resulting array can be of any size.
    --- End diff --
    
    will check that -- do you know how I can get around the fact that unidoc throws an error whenever there are logical operators like `<` or `>` in the docs? Having these would be cleaner in general
    ```
    > build/sbt unidoc
    ...
    [error] ./spark/sql/core/target/java/org/apache/spark/sql/functions.java:2834: error: malformed HTML
    [error]    *    limit < 0: <code>regex</code> will be applied as many times as possible, and the resulting
    [error]               ^



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212782550
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
    --- End diff --
    
    nit: remove `|`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r215123978
  
    --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
    @@ -394,12 +394,14 @@ public void substringSQL() {
     
       @Test
       public void split() {
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), -1),
    -      new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    +    UTF8String[] negativeAndZeroLimitCase =
    +            new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi"), fromString("")};
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), 0),
    +            negativeAndZeroLimitCase));
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), -1),
    --- End diff --
    
    Let's fix the indentation to show less diff.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95237/testReport)** for PR 22227 at commit [`ceb3f41`](https://github.com/apache/spark/commit/ceb3f41238c8731606164cea5c45a0b87bb5d6f2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r219690825
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    --- End diff --
    
    shall we mention this is an optional param?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217000102
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,51 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *          <li>limit greater than 0
    +   *            <ul>
    +   *              <li>
    +   *                The resulting array's length will not be more than limit,
    +   *                and the resulting array's last entry will contain all input
    +   *                beyond the last matched regex.
    +   *             </li>
    +   *            </ul>
    +   *          </li>
    +   *          <li>limit less than or equal to 0
    +   *            <ul>
    +   *              <li>
    +   *                `regex` will be applied as many times as possible,
    +   *                and the resulting array can be of any size.
    +   *              </li>
    +   *            </ul>
    +   *          </li>
    +   *        </ul>
    --- End diff --
    
    oh I thought you wanted to have the explanations as sub bullets, will make that change @HyukjinKwon 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r217930893
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,10 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    --- End diff --
    
    added for `limit = 0` to catch the "change behavior" case


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95918/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95473/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214529571
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,36 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    -    """
    -    Splits str around pattern (pattern is a regular expression).
    -
    -    .. note:: pattern is a string represent the regular expression.
    -
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    -    """
    -    sc = SparkContext._active_spark_context
    -    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
    +def split(str, regex, limit=-1):
    --- End diff --
    
    yes, `regex` is the part breaking..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95803/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562340
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <p><p>
    +   *        limit greater than 0: The resulting array's length will not be more than `limit`,
    +   *                              and the resulting array's last entry will contain all input beyond
    +   *                              the last matched regex.
    +   *        <p><p>
    +   *        limit less than or equal to 0: `regex` will be applied as many times as possible, and
    +   *                                       the resulting array can be of any size.
    --- End diff --
    
    I think you can refer https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L338-L386


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95907/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213913244
  
    --- Diff: sql/core/src/test/resources/sql-tests/inputs/string-functions.sql ---
    @@ -47,3 +47,7 @@ FROM (
         encode(string(id + 3), 'utf-8') col4
       FROM range(10)
     )
    --- End diff --
    
    The result is apparently wrong. Maybe we need `;` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95470/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96826/testReport)** for PR 22227 at commit [`34ba74f`](https://github.com/apache/spark/commit/34ba74f79aad2a0e2fe9e0d6f6110a10a51c8108).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    ok to test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212783068
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2554,7 +2554,27 @@ object functions {
        * @since 1.5.0
        */
       def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +    StringSplit(str.expr, lit(pattern).expr, lit(-1).expr)
    +  }
    +
    +  /**
    +   * Splits str around pattern (pattern is a regular expression) up to `limit-1` times.
    +   *
    +   * The limit parameter controls the number of times the pattern is applied and therefore
    +   * affects the length of the resulting array. If the limit n is greater than zero then the
    +   * pattern will be applied at most n - 1 times, the array's length will be no greater than
    +   * n, and the array's last entry will contain all input beyond the last matched delimiter.
    +   * If n is non-positive then the pattern will be applied as many times as possible and the
    +   * array can have any length. If n is zero then the pattern will be applied as many times as
    +   * possible, the array can have any length, and trailing empty strings will be discarded.
    +   *
    +   * @note Pattern is a string representation of the regular expression.
    +   *
    +   * @group string_funcs
    +   * @since 1.5.0
    +   */
    +  def split(str: Column, pattern: String, limit: Int): Column = withExpr {
    +    StringSplit(str.expr, lit(pattern).expr, lit(limit).expr)
    --- End diff --
    
    nit: better to directly use `Literal`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @maropu @rxin made those changes to fallback to -1 if limit = 0, let me know what you think.
    
    @felixcheung updated R tests as well


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214561362
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    --- End diff --
    
    Shall we just keep it as `pattern`? I think we don't better change the name. Doesn;t `pattern` also make sense?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95774 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95774/testReport)** for PR 22227 at commit [`64b0afc`](https://github.com/apache/spark/commit/64b0afca802c4557b5a53aa62b7486c3d8d4fe8c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22227


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96030/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95907/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).
     * This patch **fails from timeout after a configured wait of \`400m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216185422
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1671,18 +1671,32 @@ def repeat(col, n):
     
     @since(1.5)
     @ignore_unicode_prefix
    -def split(str, pattern):
    +def split(str, pattern, limit=-1):
         """
    -    Splits str around pattern (pattern is a regular expression).
    +    Splits str around matches of the given pattern.
     
    -    .. note:: pattern is a string represent the regular expression.
    +    :param str: a string expression to split
    +    :param pattern: a string representing a regular expression. The regex string should be
    +                  a Java regular expression.
    +    :param limit: an integer which controls the number of times `pattern` is applied.
     
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    +            * ``limit > 0``: The resulting array's length will not be more than `limit`, and the
    +                             resulting array's last entry will contain all input beyond the last
    +                             matched pattern.
    +            * ``limit <= 0``: `pattern` will be applied as many times as possible, and the resulting
    +                              array can be of any size.
    +
    +    .. versionchanged:: 2.4
    --- End diff --
    
    Let's target 3.0 since branch-2.4 is cut out.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95872/testReport)** for PR 22227 at commit [`b12ee88`](https://github.com/apache/spark/commit/b12ee881c1d9025644d075a6124f9e6e465a6378).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214379104
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3410,13 +3410,14 @@ setMethod("collect_set",
     #' \dontrun{
     #' head(select(df, split_string(df$Sex, "a")))
     #' head(select(df, split_string(df$Class, "\\d")))
    +#' head(select(df, split_string(df$Class, "\\d", 2)))
     #' # This is equivalent to the following SQL expression
     #' head(selectExpr(df, "split(Class, '\\\\d')"))}
     #' @note split_string 2.3.0
     setMethod("split_string",
               signature(x = "Column", pattern = "character"),
    -          function(x, pattern) {
    -            jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern)
    +          function(x, pattern, limit = -1) {
    +            jc <- callJStatic("org.apache.spark.sql.functions", "split", x@jc, pattern, limit)
    --- End diff --
    
    @felixcheung what's the best way to run a single unit test group with testthat in this repo? spark docs only point to ./run-tests.sh which runs all R unit tests. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562034
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,33 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
    --- End diff --
    
    I wouldn't change `since`. You can describe the behaviour changed by, for instance:
    
    ```python
            .. versionchanged:: 2.4
               The ``limit`` parameter blah blah..
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96112/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Will do @rxin. Question -- does this need to happen in `nullSafeEval()` in the StringSplit class? Unsure where the 0 -> -1 logic should live.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95311/testReport)** for PR 22227 at commit [`4e10733`](https://github.com/apache/spark/commit/4e107337a47ce590c703b757b0a44d60d6b862e1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216185520
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    +    StringSplit(str.expr, Literal(regex), Literal(-1))
    +  }
    +
    +  /**
    +   * Splits str around matches of the given regex.
    +   *
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
    +   * @param limit an integer expression which controls the number of times the regex is applied.
    +   *        <ul>
    +   *        <li>limit greater than 0: The resulting array's length will not be more than `limit`,
    +   *                              and the resulting array's last entry will contain all input
    +   *                              beyond the last matched regex.</li>
    +   *        <li>limit less than or equal to 0: `regex` will be applied as many times as possible,
    +   *                                           and the resulting array can be of any size.</li>
    +   *        </ul>
    --- End diff --
    
    Let's use the same way to make it multiple lines with https://github.com/apache/spark/blob/e754887182304ad0d622754e33192ebcdd515965/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L338-L386


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212782576
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2554,7 +2554,27 @@ object functions {
        * @since 1.5.0
        */
       def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +    StringSplit(str.expr, lit(pattern).expr, lit(-1).expr)
    +  }
    +
    +  /**
    +   * Splits str around pattern (pattern is a regular expression) up to `limit-1` times.
    --- End diff --
    
    Drop `up to `limit-1` times` in the first line? That's because the behaviour depends on values described below.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214526544
  
    --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
    @@ -394,12 +394,14 @@ public void substringSQL() {
     
       @Test
       public void split() {
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), -1),
    -      new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    +    UTF8String[] negativeAndZeroLimitCase =
    +            new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi"), fromString("")};
    --- End diff --
    
    nit: indent


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212815685
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    --- End diff --
    
    can we be more concise? e.g. presto's doc is just
    
    "Splits string on delimiter and returns an array of size at most limit. The last element in the array always contain everything left in the string. limit must be a positive number."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @gatorsmile @ueshin can you trigger this test?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #96021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96021/testReport)** for PR 22227 at commit [`69d2190`](https://github.com/apache/spark/commit/69d219018c9b2a7f3fb7dd716619f067aec0d2dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562493
  
    --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -952,6 +952,11 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
       }
     
       public UTF8String[] split(UTF8String pattern, int limit) {
    +    // Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +    // To avoid this, we fall back to -1 when the limit is 0.
    --- End diff --
    
    I also would leave a short justification for this given https://github.com/apache/spark/pull/22227#issuecomment-417471241


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r216941747
  
    --- Diff: R/pkg/R/functions.R ---
    @@ -3404,19 +3404,27 @@ setMethod("collect_set",
     #' Equivalent to \code{split} SQL function.
     #'
     #' @rdname column_string_functions
    +#' @param limit determines the length of the returned array.
    +#'          \itemize{
    +#'          \item \code{limit > 0}: length of the array will be at most \code{limit}
    +#'          \item \code{limit <= 0}: the returned array can have any length
    +#'          }
    --- End diff --
    
    To make it consistent with the reference I gave:
    
    ```
    #' @param limit determines the length of the returned array.
    #'              \itemize{
    #'              \item \code{limit > 0}: length of the array will be at most \code{limit}
    #'              \item \code{limit <= 0}: the returned array can have any length
    #'              }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212783375
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    --- End diff --
    
    I think it is better to keep original example for default value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562374
  
    --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -952,6 +952,11 @@ public static UTF8String concatWs(UTF8String separator, UTF8String... inputs) {
       }
     
       public UTF8String[] split(UTF8String pattern, int limit) {
    +    // Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +    // To avoid this, we fall back to -1 when the limit is 0.
    --- End diff --
    
    Could you narrow down why we avoid the -1 case for others? thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212781563
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2554,7 +2554,27 @@ object functions {
        * @since 1.5.0
        */
       def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +    StringSplit(str.expr, lit(pattern).expr, lit(-1).expr)
    +  }
    +
    +  /**
    +   * Splits str around pattern (pattern is a regular expression) up to `limit-1` times.
    +   *
    +   * The limit parameter controls the number of times the pattern is applied and therefore
    +   * affects the length of the resulting array. If the limit n is greater than zero then the
    +   * pattern will be applied at most n - 1 times, the array's length will be no greater than
    +   * n, and the array's last entry will contain all input beyond the last matched delimiter.
    +   * If n is non-positive then the pattern will be applied as many times as possible and the
    +   * array can have any length. If n is zero then the pattern will be applied as many times as
    +   * possible, the array can have any length, and trailing empty strings will be discarded.
    +   *
    +   * @note Pattern is a string representation of the regular expression.
    +   *
    +   * @group string_funcs
    +   * @since 1.5.0
    --- End diff --
    
    `1.5.0` -> `2.4.0`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213538271
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
    --- End diff --
    
    pattern? regex? we should use a consisntent word.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212782616
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    --- End diff --
    
    We still need to support 2 arguments. Please add a constructor `def this(str: Expression, pattern: Expression)`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95987/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95578/testReport)** for PR 22227 at commit [`d17d2df`](https://github.com/apache/spark/commit/d17d2df530a9236471b415700533c9224c94435a).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [Core] Implements split with limit ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r212783481
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,41 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied and " +
    +    "therefore affects the length of the resulting array. If the limit n is " +
    +    "greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "non-positive then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
       examples = """
         Examples:
    -      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
            ["one","two","three",""]
    +|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    + |       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: pattern :: limit :: Nil
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    --- End diff --
    
    I think we still need to do some check on `limit`. According to Presto document, `limit` must be a positive number. -1 is only used when no `limit` parameter is given (default value).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214930943
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -2546,15 +2546,39 @@ object functions {
       def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
     
       /**
    -   * Splits str around pattern (pattern is a regular expression).
    +   * Splits str around matches of the given regex.
        *
    -   * @note Pattern is a string representation of the regular expression.
    +   * @param str a string expression to split
    +   * @param regex a string representing a regular expression. The regex string should be
    +   *              a Java regular expression.
        *
        * @group string_funcs
        * @since 1.5.0
        */
    -  def split(str: Column, pattern: String): Column = withExpr {
    -    StringSplit(str.expr, lit(pattern).expr)
    +  def split(str: Column, regex: String): Column = withExpr {
    --- End diff --
    
    The reason I changed it is that every time we mentioned `pattern` in the comments/docs, we always added a phrase like "pattern, which is a regular expression ..."
    
    just felt like unnecessary explanation needed if we called the variable `regex`. Happy to change if you think necessary though!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95388/testReport)** for PR 22227 at commit [`e8c8c8c`](https://github.com/apache/spark/commit/e8c8c8c5e3811fd8e8f2c7b8c77ae404b4acc157).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    hey @HyukjinKwon , the SparkR test now failing is 
    ```
    Duplicated \argument entries in documentation object 'column_string_functions':
      'pattern'
    
    Functions with \usage entries need to have the appropriate \alias
    entries, and all their arguments documented.
    The \usage entries must correspond to syntactically valid R code.
    See the chapter 'Writing R documentation files' in the 'Writing R
    ```
    Can I not have `@param pattern` twice in the docs for the rdname `column_string_functions`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214516705
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,36 +229,74 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`,
    +                   and the resulting array's last entry will contain all input
    +                   beyond the last matched regex.
    +        limit <= 0: `regex` will be applied as many times as possible, and
    +                    the resulting array can be of any size.
    +  """,
       examples = """
         Examples:
           > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
            ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
    +       ["one","two","three",""]
    +      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
    +       ["one","twoBthreeC"]
       """)
    -case class StringSplit(str: Expression, pattern: Expression)
    -  extends BinaryExpression with ImplicitCastInputTypes {
    +case class StringSplit(str: Expression, regex: Expression, limit: Expression)
    +  extends TernaryExpression with ImplicitCastInputTypes {
     
    -  override def left: Expression = str
    -  override def right: Expression = pattern
       override def dataType: DataType = ArrayType(StringType)
    -  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
    +  override def inputTypes: Seq[DataType] = Seq(StringType, StringType, IntegerType)
    +  override def children: Seq[Expression] = str :: regex :: limit :: Nil
    +
    +  def this(exp: Expression, regex: Expression) = this(exp, regex, Literal(-1));
     
    -  override def nullSafeEval(string: Any, regex: Any): Any = {
    -    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
    +  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
    +    val strings = string.asInstanceOf[UTF8String].split(
    +      regex.asInstanceOf[UTF8String], maybeFallbackLimitValue(limit.asInstanceOf[Int]))
         new GenericArrayData(strings.asInstanceOf[Array[Any]])
       }
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         val arrayClass = classOf[GenericArrayData].getName
    -    nullSafeCodeGen(ctx, ev, (str, pattern) =>
    +    nullSafeCodeGen(ctx, ev, (str, regex, limit) => {
           // Array in java is covariant, so we don't need to cast UTF8String[] to Object[].
    -      s"""${ev.value} = new $arrayClass($str.split($pattern, -1));""")
    +      s"""${ev.value} = new $arrayClass($str.split(
    +         $regex,${handleCodeGenLimitFallback(limit)}));""".stripMargin
    +    })
       }
     
       override def prettyName: String = "split"
    +
    +  /**
    +   * Java String's split method supports "ignore empty string" behavior when the limit is 0.
    +   * To avoid this, we fall back to -1 when the limit is 0. Otherwise, this is a noop.
    +   */
    +  def maybeFallbackLimitValue(limit: Int): Int = {
    --- End diff --
    
    done!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @HyukjinKwon all tests now passing! Let me know if there is anything else to do for this PR. Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214165774
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,58 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around matches of the given regex.
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * regex - a string representing a regular expression. The regex string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the regex is applied.
    +
    +        limit > 0: The resulting array's length will not be more than `limit`, and the resulting
    +                   array's last entry will contain all input beyond the last matched regex.
    +
    +        limit < 0: `regex` will be applied as many times as possible, and the resulting
    +                   array can be of any size.
    +
    +        limit = 0: `regex` will be applied as many times as possible, the resulting array can
    --- End diff --
    
    I see this as a value-add in that the user just gets more ammunition to answer a given question about his/her data. Sure, it's not necessary to have the `limit = 0` case, but if a user can get what they need by writing fewer lines of code because it exists I'd say it's definitely worth exposing.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    @maropu @HyukjinKwon let me know what you think, took care of your comments


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213540571
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`" +
    +    " and returns an array of at most `limit`",
    +  arguments = """
    +    Arguments:
    +      * str - a string expression to split.
    +      * pattern - a string representing a regular expression. The pattern string should be a
    +        Java regular expression.
    +      * limit - an integer expression which controls the number of times the pattern is applied.
    +
    +        limit > 0:
    +          The resulting array's length will not be more than `limit`, and the resulting array's
    +          last entry will contain all input beyond the last matched pattern.
    +
    +        limit < 0:
    +          `pattern` will be applied as many times as possible, and the resulting
    +          array can be of any size.
    +
    +        limit = 0:
    +          `pattern` will be applied as many times as possible, the resulting array can
    +          be of any size, and trailing empty strings will be discarded.
    +  """,
    --- End diff --
    
    How about this formatting?;
    ```
    
     function_desc | Extended Usage:
        Arguments:
          * str - a string expression to split.
          * pattern - a string representing a regular expression. The pattern string should be a
            Java regular expression.
          * limit - an integer expression which controls the number of times the pattern is applied.
    
            limit > 0: The resulting array's length will not be more than `limit`, and the resulting array's
                       last entry will contain all input beyond the last matched pattern.
            limit < 0: `pattern` will be applied as many times as possible, and the resulting
                       array can be of any size.
            limit = 0: `pattern` will be applied as many times as possible, the resulting array can
                       be of any size, and trailing empty strings will be discarded.
      
        Examples:
          > SELECT split('oneAtwoBthreeC', '[ABC]');
           ["one","two","three",""]
          > SELECT split('oneAtwoBthreeC', '[ABC]', 0);
           ["one","two","three"]
          > SELECT split('oneAtwoBthreeC', '[ABC]', 2);
           ["one","twoBthreeC"]
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214357874
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -1669,20 +1669,36 @@ def repeat(col, n):
         return Column(sc._jvm.functions.repeat(_to_java_column(col), n))
     
     
    -@since(1.5)
    +@since(2.4)
     @ignore_unicode_prefix
    -def split(str, pattern):
    -    """
    -    Splits str around pattern (pattern is a regular expression).
    -
    -    .. note:: pattern is a string represent the regular expression.
    -
    -    >>> df = spark.createDataFrame([('ab12cd',)], ['s',])
    -    >>> df.select(split(df.s, '[0-9]+').alias('s')).collect()
    -    [Row(s=[u'ab', u'cd'])]
    -    """
    -    sc = SparkContext._active_spark_context
    -    return Column(sc._jvm.functions.split(_to_java_column(str), pattern))
    +def split(str, regex, limit=-1):
    --- End diff --
    
    I'll change back to `pattern` here. And just curious, how is this an API break if it's an optional parameter? @felixcheung 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562429
  
    --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java ---
    @@ -394,12 +394,14 @@ public void substringSQL() {
     
       @Test
       public void split() {
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), -1),
    -      new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    -    assertTrue(Arrays.equals(fromString("ab,def,ghi").split(fromString(","), 2),
    -      new UTF8String[]{fromString("ab"), fromString("def,ghi")}));
    +    UTF8String[] negativeAndZeroLimitCase =
    +            new UTF8String[]{fromString("ab"), fromString("def"), fromString("ghi"), fromString("")};
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), 0),
    +            negativeAndZeroLimitCase));
    +    assertTrue(Arrays.equals(fromString("ab,def,ghi,").split(fromString(","), -1),
    --- End diff --
    
    Why should we change the existing tests? Just add one test to check 
    
    ```
        if (limit == 0) {
          limit = -1;
        }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213844148
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -229,33 +229,59 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
     
     
     /**
    - * Splits str around pat (pattern is a regular expression).
    + * Splits str around pattern (pattern is a regular expression).
    --- End diff --
    
    going to switch to `regex`, makes more sense given that with the use of `pattern` we always have to define it as a regex


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r214562388
  
    --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
    @@ -1803,6 +1803,18 @@ test_that("string operators", {
         collect(select(df4, split_string(df4$a, "\\\\")))[1, 1],
         list(list("a.b@c.d   1", "b"))
       )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
    +    list(list("a", "b@c.d   1\\b"))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", -2)))[1, 1],
    +    list(list("a.", "@c.d   1\\", ""))
    +  )
    +  expect_equal(
    +    collect(select(df4, split_string(df4$a, "b", 0)))[1, 1],
    --- End diff --
    
    I wouldn't add all of those cases for R. One test case to check if they can be called should be good enough.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    **[Test build #95388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95388/testReport)** for PR 22227 at commit [`e8c8c8c`](https://github.com/apache/spark/commit/e8c8c8c5e3811fd8e8f2c7b8c77ae404b4acc157).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213176423
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,49 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied. If the limit " +
    +    "n is greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "less than 0, then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
    --- End diff --
    
    +1 for https://github.com/apache/spark/pull/22227#discussion_r212815685. The doc should better be concise.
    
    Can we just move those `limit` specific description into the arguments at `limit - a..`? This looks a bit messy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

Posted by phegstrom <gi...@git.apache.org>.
Github user phegstrom commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22227#discussion_r213317001
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -232,30 +232,49 @@ case class RLike(left: Expression, right: Expression) extends StringRegexExpress
      * Splits str around pat (pattern is a regular expression).
      */
     @ExpressionDescription(
    -  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
    +  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
    +    "The `limit` parameter controls the number of times the pattern is applied. If the limit " +
    +    "n is greater than zero then the pattern will be applied at most n - 1 times, " +
    +    "the array's length will be no greater than n, and the array's last entry " +
    +    "will contain all input beyond the last matched delimiter. If n is " +
    +    "less than 0, then the pattern will be applied as many times as " +
    +    "possible and the array can have any length. If n is zero then the " +
    +    "pattern will be applied as many times as possible, the array can " +
    +    "have any length, and trailing empty strings will be discarded.",
    --- End diff --
    
    will do!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22227: [SPARK-25202] [Core] Implements split with limit sql fun...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22227
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org