You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeff Evans (Jira)" <ji...@apache.org> on 2020/01/23 19:54:00 UTC
[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

    [ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-------------------------------------------------------------

After some debugging, I figured out what's going on here.  The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399.  This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the literal backslash before the dot character, so you would need the pattern to be {{'( |\\\\.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399.  This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -----------------------------------------
>
>                 Key: SPARK-19248
>                 URL: https://issues.apache.org/jira/browse/SPARK-19248
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.2, 2.4.3
>            Reporter: Lucas Tittmann
>            Priority: Major
>              Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the Spark version. We checked the regex in Java, and both should be correct and work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org