You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2016/07/01 17:48:11 UTC

[jira] [Commented] (SPARK-16324) regexp_extract returns empty string when match fails

    [ https://issues.apache.org/jira/browse/SPARK-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359361#comment-15359361 ] 

Jeff Zhang commented on SPARK-16324:
------------------------------------

I think this is by design
{code}
  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
    if (!p.equals(lastRegex)) {
      // regex value changed
      lastRegex = p.asInstanceOf[UTF8String].clone()
      pattern = Pattern.compile(lastRegex.toString)
    }
    val m = pattern.matcher(s.toString)
    if (m.find) {
      val mr: MatchResult = m.toMatchResult
      UTF8String.fromString(mr.group(r.asInstanceOf[Int]))
    } else {
      UTF8String.EMPTY_UTF8
    }
  }
{code}

> regexp_extract returns empty string when match fails
> ----------------------------------------------------
>
>                 Key: SPARK-16324
>                 URL: https://issues.apache.org/jira/browse/SPARK-16324
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> The documentation for regexp_extract isn't clear about how it should behave if the regex didn't match the row. However, the Java documentation it refers for further detail suggests that the return value should be null if the group wasn't matched at all, empty string is the group actually matched empty string, and an exception raised if the entire regex didn't match.
> This would be identical to how python's own re module behaves when a MatchObject.group() is called.
> However, in practice regexp_extract() returns empty string when the match fails. This seems to be a bug; if it was intended as a feature, it should have been documented as such - and it was probably not a good idea since it can result in silent bugs.
> {code}
> import pyspark.sql.functions as F
> df = spark.createDataFrame([['abc']], ['text'])
> assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == ''
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org