You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by watermen <gi...@git.apache.org> on 2015/09/11 11:30:25 UTC

[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

GitHub user watermen opened a pull request:

    https://github.com/apache/spark/pull/8715

    [SPARK-9213] [SQL] [WIP] Improve regular expression performance (via joni)

    The joni regex engine (https://github.com/jruby/joni), a Java port of Oniguruma regexp library done by the JRuby project, is:
    * MIT licensed
    * Designed to work with byte[] arguments instead of String
    * Capable of handling UTF8 encoding
    * Regex syntax compatible
    * Interruptible
    * About twice as fast as j.u.regex
    * Has JRuby's jcodings library as a dependency, also MIT licensed
    
    ToDoList:
    - [ ] Fix the bug in `StringUtils.escapeLikeRegex` and pass all of the testes with Like
    - [ ] list syntax required (any unordered or ordered list supported)
    - [ ] Add joni as regular expression engine in `StringSplit`
    
    /cc @rxin Can you review the code for me?
    @scwf

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/watermen/spark SPARK-9213

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8715
    
----
commit 30c183470d549aeec7089eb2317c0569e869941d
Author: Yadong Qi <qi...@gmail.com>
Date:   2015-09-11T09:17:53Z

    Add joni as the default regular expression engine.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by watermen <gi...@git.apache.org>.
Github user watermen closed the pull request at:

    https://github.com/apache/spark/pull/8715


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139499523
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139500714
  
      [Test build #42323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42323/consoleFull) for   PR 8715 at commit [`30c1834`](https://github.com/apache/spark/commit/30c183470d549aeec7089eb2317c0569e869941d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139499551
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139530604
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8715#discussion_r39255692
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala ---
    @@ -68,24 +69,32 @@ trait StringRegexExpression extends ImplicitCastInputTypes {
     case class Like(left: Expression, right: Expression)
       extends BinaryExpression with StringRegexExpression with CodegenFallback {
     
    -  override def escape(v: String): String = StringUtils.escapeLikeRegex(v)
    +  override def escape(v: Array[Byte]): Array[Byte] = StringUtils.escapeLikeRegex(v)
     
    -  override def matches(regex: Pattern, str: String): Boolean = regex.matcher(str).matches()
    +  override def matches(regex: Regex, input: Array[Byte]): Boolean =
    +    regex.matcher(input).`match`(0, input.length, Option.DEFAULT) > -1
     
       override def toString: String = s"$left LIKE $right"
     
       override protected def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
    -    val patternClass = classOf[Pattern].getName
    +    val regexClass = classOf[Regex].getName
    +    val optionClass = classOf[Option].getName
    +    val encodingClass = classOf[UTF8Encoding].getName
         val escapeFunc = StringUtils.getClass.getName.stripSuffix("$") + ".escapeLikeRegex"
    -    val pattern = ctx.freshName("pattern")
    +    val regex = ctx.freshName("regex")
     
         if (right.foldable) {
           val rVal = right.eval()
           if (rVal != null) {
    -        val regexStr =
    -          StringEscapeUtils.escapeJava(escape(rVal.asInstanceOf[UTF8String].toString()))
    -        ctx.addMutableState(patternClass, pattern,
    -          s"""$pattern = ${patternClass}.compile("$regexStr");""")
    +        val tmp =
    --- End diff --
    
    @rxin How can I pass `byte[]` or `UTF8String` into `s""" """`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-141372710
  
    Can you pass the string in as a variable of the current expression? I think the code gen thing can access the current expression object.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139530607
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42323/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139604369
  
    It would be great to do some performance benchmarks too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139530472
  
      [Test build #42323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42323/console) for   PR 8715 at commit [`30c1834`](https://github.com/apache/spark/commit/30c183470d549aeec7089eb2317c0569e869941d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait StringRegexExpressionJavaFallback extends ImplicitCastInputTypes `
      * `case class LikeJavaFallback(left: Expression, right: Expression)`
      * `case class RLikeJavaFallback(left: Expression, right: Expression)`
      * `case class StringSplitJavaFallback(str: Expression, pattern: Expression)`
      * `case class RegExpReplaceJavaFallback(subject: Expression, regexp: Expression, rep: Expression)`
      * `case class RegExpExtractJavaFallback(subject: Expression, regexp: Expression, idx: Expression)`
      * `    case class PairInt(begin: Int, end: Int)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-165028653
  
    @watermen  can you close this first? We can reopen when it is ready.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9213] [SQL] [WIP] Improve regular expre...

Posted by watermen <gi...@git.apache.org>.
Github user watermen commented on the pull request:

    https://github.com/apache/spark/pull/8715#issuecomment-139775417
  
    @rxin I'll do some performance benchmarks later, and do you have any idea about passing `byte[]` or `UTF8String` into `s""" """`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org