You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lucas Tittmann (JIRA)" <ji...@apache.org> on 2017/01/16 16:48:26 UTC
[jira] [Created] (SPARK-19248) Regex_replace works in 1.6 but not
in 2.0
Lucas Tittmann created SPARK-19248:
--------------------------------------
Summary: Regex_replace works in 1.6 but not in 2.0
Key: SPARK-19248
URL: https://issues.apache.org/jira/browse/SPARK-19248
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.2
Reporter: Lucas Tittmann
We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we get the following, expected behaviour:
{noformat}
df = sqlContext.createDataFrame([('.. 5. ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
>>> [Row(col=u'5')]
{noformat}
In Spark 2.0.2, with the same code, we get the following:
{noformat}
df = sqlContext.createDataFrame([('.. 5. ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
{color:red}
>>> [Row(col=u'')]
{color}
{noformat}
As you can see, the second regex shows different behaviour depending on the Spark version. We checked the regex in Java, and both should be correct and work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have the possibility to confirm in 2.1 at the moment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org