You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Max Moroz (JIRA)" <ji...@apache.org> on 2016/06/30 07:53:10 UTC

[jira] [Created] (SPARK-16324) regexp_extract returns empty string when match fails

Max Moroz created SPARK-16324:
---------------------------------

             Summary: regexp_extract returns empty string when match fails
                 Key: SPARK-16324
                 URL: https://issues.apache.org/jira/browse/SPARK-16324
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Max Moroz
            Priority: Minor


The documentation for regexp_extract isn't clear about how it should behave if the regex didn't match the row. However, the Java documentation it refers for further detail suggests that the return value should be null if the group wasn't matched at all, empty string is the group actually matched empty string, and an exception raised if the entire regex didn't match.

This would be identical to how python's own re module behaves when a MatchObject.group() is called.

However, in practice regexp_extract() returns empty string when the match fails. This seems to be a bug; if it was intended as a feature, it should have been documented as such - and it was probably not a good idea since it can result in silent bugs.

{code}
import pyspark.sql.functions as F
df = spark.createDataFrame([['abc']], ['text'])
assert df.select(F.regexp_extract('text', r'z', 1)).first()[0] == ''
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org