You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Lucas Tittmann (JIRA)" <ji...@apache.org> on 2017/01/16 16:48:26 UTC

[jira] [Created] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

Lucas Tittmann created SPARK-19248:
--------------------------------------

             Summary: Regex_replace works in 1.6 but not in 2.0
                 Key: SPARK-19248
                 URL: https://issues.apache.org/jira/browse/SPARK-19248
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.2
            Reporter: Lucas Tittmann


We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we get the following, expected behaviour:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
>>> [Row(col=u'5')]
{noformat}
In Spark 2.0.2, with the same code, we get the following:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
{color:red}
>>> [Row(col=u'')]
{color}
{noformat}

As you can see, the second regex shows different behaviour depending on the Spark version. We checked the regex in Java, and both should be correct and work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org