You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2019/05/21 06:20:00 UTC

[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

    [ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844547#comment-16844547 ] 

Nicholas Chammas commented on SPARK-19248:
------------------------------------------

Looks like Spark 2.4.3 still exhibits the behavior reported in the original issue: 
{code:java}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Python version 3.7.3 (default, Mar 27 2019 13:25:00)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame([('..   5.    ',)], ['col'])
>>> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
>>> dfout                                                                       
[Row(col='5')]
>>> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
>>> dfout2
[Row(col='')]
>>> 
{code}
[~hyukjin.kwon] - I'm going to reopen this issue.

> Regex_replace works in 1.6 but not in 2.0
> -----------------------------------------
>
>                 Key: SPARK-19248
>                 URL: https://issues.apache.org/jira/browse/SPARK-19248
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Lucas Tittmann
>            Priority: Major
>              Labels: bulk-closed
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the Spark version. We checked the regex in Java, and both should be correct and work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org