You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Khalid Mammadov <kh...@gmail.com> on 2022/10/02 16:19:17 UTC

Missing string replace function

Hi,

As you know there's no string "replace" function inside
pyspark.sql.functions for PySpark nor in org.apache.sql.functions for
Scala/Java and was wondering why is that so? And I know there's
regexp_replace instead and na.replace or SQL with expr.

I think it's one of the fundamental functions in users/developers toolset
and available almost in every language. It takes time for new Spark devs to
realise it's not there and to use alternative ones. So, I think it would be
nice to have one.
I had already got a prototype for Scala (which is just a sugar over
regexp_replace) and works like a charm:)

Would like to know your opinion to contribute or not needed...

Thanks
Khalid

Re: EXT: Re: Missing string replace function

Posted by Vibhor Gupta <Vi...@walmart.com.INVALID>.
Hi Khalid,

See https://issues.apache.org/jira/browse/SPARK-31628.

It might just be a syntactic sugar over the StringReplace<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L662> class, but it makes the things a little easier and neater.

There are a lot of such missing APIs in scala and python.

Regards,
Vibhor


________________________________
From: russell.spitzer@gmail.com <ru...@gmail.com>
Sent: Monday, October 3, 2022 12:31 AM
To: Khalid Mammadov <kh...@gmail.com>
Cc: dev <de...@spark.apache.org>
Subject: EXT: Re: Missing string replace function

EXTERNAL: Report suspicious emails to Email Abuse.

Ah for that I think it makes sense to add in a function but it probably should not be an alias for regex replace since that has very different semantics for certain string arguments

Sent from my iPhone

On Oct 2, 2022, at 1:31 PM, Khalid Mammadov <kh...@gmail.com> wrote:


Thanks Russell for checking this out!

This is a good example of a replace which is available in the Sapk SQL but not in the PySpark API nor in Scala API unfortunately.
Another alternative to this is mentioned regexp_replace, but as a developer looking for replace function we tend to ignore regex version as it's not what we usually look for and then realise there is not built in replace utility function and have to use regexp alternative.

So, to give an example, it is possible now to do something like this:
scala> val df = Seq("aaa zzz").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.select(expr("replace(value, 'aaa', 'bbb')")).show()
+------------------------+
|replace(value, aaa, bbb)|
+------------------------+
|                 bbb zzz|
+------------------------+

But not this:
df.select(replace('value, "aaa", "ooo")).show()
as replace function is not available in functions modules both PySpark and Scala.

And this is the output from my local prototype which would be good to see in the official API:
scala> df.select(replace('value, "aaa", "ooo")).show()
+----------------------------------+
|regexp_replace(value, aaa, ooo, 1)|
+----------------------------------+
|                           ooo zzz|
+----------------------------------+

WDYT?


On Sun, Oct 2, 2022 at 6:24 PM Russell Spitzer <ru...@gmail.com>> wrote:
Quick test on on 3.2 confirms everything should be working as expected

scala> spark.createDataset(Seq(("foo", "bar")))
res0: org.apache.spark.sql.Dataset[(String, String)] = [_1: string, _2: string]

scala> spark.createDataset(Seq(("foo", "bar"))).createTempView("temp")

scala> spark.sql("SELECT replace(_1, 'fo', 'bo') from temp").show
+-------------------+
|replace(_1, fo, bo)|
+-------------------+
|                boo|
+-------------------+

On Oct 2, 2022, at 12:21 PM, Russell Spitzer <ru...@gmail.com>> wrote:

https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace<https://urldefense.com/v3/__https://spark.apache.org/docs/3.3.0/api/sql/index.html*replace__;Iw!!IfjTnhH9!U4BTEWChXelKPQe2un8hu8QiB9u1eS7pWoYfFCBA3me4QiZtfw8sB43FdMCVBsj9ErZMm1Q6Kj_Pkck5J7BmPpRqFjhv2A$>

This was added in Spark 2.3.0 as far as I can tell.

https://github.com/apache/spark/pull/18047<https://urldefense.com/v3/__https://github.com/apache/spark/pull/18047__;!!IfjTnhH9!U4BTEWChXelKPQe2un8hu8QiB9u1eS7pWoYfFCBA3me4QiZtfw8sB43FdMCVBsj9ErZMm1Q6Kj_Pkck5J7BmPpTOJ773oA$>

On Oct 2, 2022, at 11:19 AM, Khalid Mammadov <kh...@gmail.com>> wrote:

Hi,

As you know there's no string "replace" function inside pyspark.sql.functions for PySpark nor in org.apache.sql.functions for Scala/Java and was wondering why is that so? And I know there's regexp_replace instead and na.replace or SQL with expr.

I think it's one of the fundamental functions in users/developers toolset and available almost in every language. It takes time for new Spark devs to realise it's not there and to use alternative ones. So, I think it would be nice to have one.
I had already got a prototype for Scala (which is just a sugar over regexp_replace) and works like a charm:)

Would like to know your opinion to contribute or not needed...

Thanks
Khalid




Re: Missing string replace function

Posted by ru...@gmail.com.
Ah for that I think it makes sense to add in a function but it probably should
not be an alias for regex replace since that has very different semantics for
certain string arguments  
  

Sent from my iPhone

  

> On Oct 2, 2022, at 1:31 PM, Khalid Mammadov <kh...@gmail.com>
> wrote:  
>  
>

> 
>
> Thanks Russell for checking this out!
>
>  
>
>
> This is a good example of a **replace** which is available in the Sapk SQL
> but not in the PySpark API nor in Scala API unfortunately.
>
> Another alternative to this is mentioned regexp_replace, but as a developer
> looking for **replace** function we tend to ignore regex version as it's not
> what we usually look for and then realise there is not built in replace
> utility function and have to use regexp alternative.
>
>  
>
>
> So, to give an example, it is possible now to do something like this:
>

>> scala> val df = Seq("aaa zzz").toDF  
> df: org.apache.spark.sql.DataFrame = [value: string]  
> scala> df.select(expr("replace(value, 'aaa', 'bbb')")).show()
>

>> +------------------------+  
> |replace(value, aaa, bbb)|  
> +------------------------+  
> |                 bbb zzz|  
> +------------------------+  
>
>
>  
>
> But not this:
>

>> df.select(replace('value, "aaa", "ooo")).show()

>
> as **replace** function is not available in functions modules both PySpark
> and Scala.
>
>  
>
>
> And this is the output from my local prototype which would be good to see in
> the official API:
>

>> scala> df.select(replace('value, "aaa", "ooo")).show()  
> +----------------------------------+  
> |regexp_replace(value, aaa, ooo, 1)|  
> +----------------------------------+  
> |                           ooo zzz|  
> +----------------------------------+
>
>  
>
>
> WDYT?  
>
>
>  
>
>
>  
>
>
> On Sun, Oct 2, 2022 at 6:24 PM Russell Spitzer
> <[russell.spitzer@gmail.com](mailto:russell.spitzer@gmail.com)> wrote:  
>
>

>> Quick test on on 3.2 confirms everything should be working as expected

>>

>>  
>
>>

>> scala> spark.createDataset(Seq(("foo", "bar")))

>>

>> res0: org.apache.spark.sql.Dataset[(String, String)] = [_1: string, _2:
string]

>>

>>  
>
>>

>> scala> spark.createDataset(Seq(("foo", "bar"))).createTempView("temp")

>>

>>  
>
>>

>> scala> spark.sql("SELECT replace(_1, 'fo', 'bo') from temp").show

>>

>> +-------------------+

>>

>> |replace(_1, fo, bo)|

>>

>> +-------------------+

>>

>> |                boo|

>>

>> +-------------------+

>>

>>  
>
>>

>>> On Oct 2, 2022, at 12:21 PM, Russell Spitzer
<[russell.spitzer@gmail.com](mailto:russell.spitzer@gmail.com)> wrote:

>>>

>>>  
>
>>>

>>> <https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace>

>>>

>>>  
>
>>>

>>> This was added in Spark 2.3.0 as far as I can tell.

>>>

>>>  
>
>>>

>>> <https://github.com/apache/spark/pull/18047>

>>>

>>>  
>
>>>

>>>> On Oct 2, 2022, at 11:19 AM, Khalid Mammadov
<[khalidmammadov9@gmail.com](mailto:khalidmammadov9@gmail.com)> wrote:

>>>>

>>>>  
>
>>>>

>>>> Hi,

>>>>

>>>>  
>
>>>>

>>>> As you know there's no string "replace" function inside
pyspark.sql.functions for PySpark nor in org.apache.sql.functions for
Scala/Java and was wondering why is that so? And I know there's regexp_replace
instead and na.replace or SQL with expr.

>>>>

>>>>  
>
>>>>

>>>> I think it's one of the fundamental functions in users/developers toolset
and available almost in every language. It takes time for new Spark devs to
realise it's not there and to use alternative ones. So, I think it would be
nice to have one.

>>>>

>>>> I had already got a prototype for Scala (which is just a sugar over
regexp_replace) and works like a charm:)

>>>>

>>>>  
>
>>>>

>>>> Would like to know your opinion to contribute or not needed...

>>>>

>>>>  
>
>>>>

>>>> Thanks

>>>>

>>>> Khalid

>>>>

>>>>  
>
>>>

>>>  
>
>>

>>  
>


Re: Missing string replace function

Posted by Khalid Mammadov <kh...@gmail.com>.
Thanks Russell for checking this out!

This is a good example of a *replace* which is available in the Sapk SQL
but not in the PySpark API nor in Scala API unfortunately.
Another alternative to this is mentioned regexp_replace, but as a developer
looking for *replace* function we tend to ignore regex version as it's not
what we usually look for and then realise there is not built in replace
utility function and have to use regexp alternative.

So, to give an example, it is possible now to do something like this:

> scala> val df = Seq("aaa zzz").toDF
> df: org.apache.spark.sql.DataFrame = [value: string]
> scala> df.select(expr("replace(value, 'aaa', 'bbb')")).show()
>
+------------------------+
> |replace(value, aaa, bbb)|
> +------------------------+
> |                 bbb zzz|
> +------------------------+
>

But not this:

> df.select(replace('value, "aaa", "ooo")).show()
>
as *replace* function is not available in functions modules both PySpark
and Scala.

And this is the output from my local prototype which would be good to see
in the official API:

> scala> df.select(replace('value, "aaa", "ooo")).show()
> +----------------------------------+
> |regexp_replace(value, aaa, ooo, 1)|
> +----------------------------------+
> |                           ooo zzz|
> +----------------------------------+
>

WDYT?


On Sun, Oct 2, 2022 at 6:24 PM Russell Spitzer <ru...@gmail.com>
wrote:

> Quick test on on 3.2 confirms everything should be working as expected
>
> scala> spark.createDataset(Seq(("foo", "bar")))
> res0: org.apache.spark.sql.Dataset[(String, String)] = [_1: string, _2:
> string]
>
> scala> spark.createDataset(Seq(("foo", "bar"))).createTempView("temp")
>
> scala> spark.sql("SELECT replace(_1, 'fo', 'bo') from temp").show
> +-------------------+
> |replace(_1, fo, bo)|
> +-------------------+
> |                boo|
> +-------------------+
>
> On Oct 2, 2022, at 12:21 PM, Russell Spitzer <ru...@gmail.com>
> wrote:
>
> https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace
>
> This was added in Spark 2.3.0 as far as I can tell.
>
> https://github.com/apache/spark/pull/18047
>
> On Oct 2, 2022, at 11:19 AM, Khalid Mammadov <kh...@gmail.com>
> wrote:
>
> Hi,
>
> As you know there's no string "replace" function inside
> pyspark.sql.functions for PySpark nor in org.apache.sql.functions for
> Scala/Java and was wondering why is that so? And I know there's
> regexp_replace instead and na.replace or SQL with expr.
>
> I think it's one of the fundamental functions in users/developers toolset
> and available almost in every language. It takes time for new Spark devs to
> realise it's not there and to use alternative ones. So, I think it would be
> nice to have one.
> I had already got a prototype for Scala (which is just a sugar over
> regexp_replace) and works like a charm:)
>
> Would like to know your opinion to contribute or not needed...
>
> Thanks
> Khalid
>
>
>
>

Re: Missing string replace function

Posted by Russell Spitzer <ru...@gmail.com>.
Quick test on on 3.2 confirms everything should be working as expected

scala> spark.createDataset(Seq(("foo", "bar")))
res0: org.apache.spark.sql.Dataset[(String, String)] = [_1: string, _2: string]

scala> spark.createDataset(Seq(("foo", "bar"))).createTempView("temp")

scala> spark.sql("SELECT replace(_1, 'fo', 'bo') from temp").show
+-------------------+
|replace(_1, fo, bo)|
+-------------------+
|                boo|
+-------------------+

> On Oct 2, 2022, at 12:21 PM, Russell Spitzer <ru...@gmail.com> wrote:
> 
> https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace <https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace> 
> 
> This was added in Spark 2.3.0 as far as I can tell.
> 
> https://github.com/apache/spark/pull/18047 <https://github.com/apache/spark/pull/18047>
> 
>> On Oct 2, 2022, at 11:19 AM, Khalid Mammadov <khalidmammadov9@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> As you know there's no string "replace" function inside pyspark.sql.functions for PySpark nor in org.apache.sql.functions for Scala/Java and was wondering why is that so? And I know there's regexp_replace instead and na.replace or SQL with expr.
>> 
>> I think it's one of the fundamental functions in users/developers toolset and available almost in every language. It takes time for new Spark devs to realise it's not there and to use alternative ones. So, I think it would be nice to have one.
>> I had already got a prototype for Scala (which is just a sugar over regexp_replace) and works like a charm:)
>> 
>> Would like to know your opinion to contribute or not needed...
>> 
>> Thanks
>> Khalid
>> 
> 


Re: Missing string replace function

Posted by Russell Spitzer <ru...@gmail.com>.
https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace <https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace> 

This was added in Spark 2.3.0 as far as I can tell.

https://github.com/apache/spark/pull/18047 <https://github.com/apache/spark/pull/18047>

> On Oct 2, 2022, at 11:19 AM, Khalid Mammadov <kh...@gmail.com> wrote:
> 
> Hi,
> 
> As you know there's no string "replace" function inside pyspark.sql.functions for PySpark nor in org.apache.sql.functions for Scala/Java and was wondering why is that so? And I know there's regexp_replace instead and na.replace or SQL with expr.
> 
> I think it's one of the fundamental functions in users/developers toolset and available almost in every language. It takes time for new Spark devs to realise it's not there and to use alternative ones. So, I think it would be nice to have one.
> I had already got a prototype for Scala (which is just a sugar over regexp_replace) and works like a charm:)
> 
> Would like to know your opinion to contribute or not needed...
> 
> Thanks
> Khalid
>