You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Augustine Theodore Prince (Jira)" <ji...@apache.org> on 2022/07/15 21:37:00 UTC

[jira] [Updated] (SPARK-39796) Add a regexp_extract variant which returns an array of all the matched capture groups

     [ https://issues.apache.org/jira/browse/SPARK-39796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Augustine Theodore Prince updated SPARK-39796:
----------------------------------------------
    Description: 
 

regexp_extract only returns a single matched group. In a lot of cases we need to parse the entire string and get all the groups and for that we'll need to call it as many times as there are groups. The regexp_extract_all function doesn't solve this problem as it only works if all the groups have the same regex pattern.

 

_Example:_

I will provide an example and the current workaround that I use to solve this,

If I have the following dataframe and I would like to match the column 'a' with this pattern
{code:java}
"([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
|a|
|Hello, World, 1234|
|Good, bye, friend|

 

My expected output  is as follows:
|a|extracted_a|
|Hello, World, 1234|[Hello, 1234]|
|Good, bye, friend|[]|

 

However, to achieve this I have to take the following approach which seems very hackish.
1. Use regexp_replace to create a temporary string built using the extracted groups:
{code:java}
df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, (\\d+)", "$1_$2")){code}
A side effect of regexp_replace is that if the regex fails to match the entire string is returned.

 
|a|extracted_a|
|Hello, World, 1234|Hello_1234|
|Good, bye, friend|Good, bye, friend|

2. So, to achieve the desired result, a check has to be done to prune the rows that did not match with the pattern :
{code:java}
df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , None).otherwise(F.col("extracted_a"))){code}
 
to get the following intermediate dataframe,
|a|extracted_a|
|Hello, World, 1234|Hello_1234|
|Good, bye, friend|null|

 

3. Before finally splitting the column 'extracted_a' based on underscores
{code:java}
df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
which results in the desired result:

 
 
|a|extracted_a
|
|Hello, World, 1234|[Hello, 1234]|
|Good, bye, friend|null|

 

  was:
 

regexp_extract only returns a single matched group. In a lot of cases we need to parse the entire string and get all the groups and for that we'll need to call it as many times as there are groups. The regexp_extract_all function doesn't solve this problem as it only works if all the groups have the same regex pattern.

 

_Example:_

I will provide an example and the current workaround that I use to solve this,

If I have the following dataframe and I would like to match the column 'a' with this pattern
{code:java}
"([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
|a|
|Hello, World, 1234|
|Good, bye, friend|

 

My expected output  is as follows:
|a|extracted_a|
|Hello, World, 1234|[Hello, 1234]|
|Good, bye, friend|[]|

 

However, to achieve this I have to take the following approach which seems very hackish.
1. Use regexp_replace to create a temporary string built using the extracted groups:
{code:java}
df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, (\\d+)", "$1_$2")){code}
A side effect of regexp_replace is that if the regex fails to match the entire string is returned.

 
|a|extracted_a|
|Hello, World, 1234|Hello_1234|
|Good, bye, friend|Good, bye, friend|

2. So, to achieve the desired result, a check has to be done to prune the rows that did not match with the pattern :
{code:java}
df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , None).otherwise(F.col("extracted_a"))){code}
 
to get the following intermediate dataframe,
|a|extracted_a|
|Hello, World, 1234|Hello_1234|
|Good, bye, friend|null|

 

3. Before finally splitting the column 'extracted_a' based on underscores
{code:java}
df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
which results in the desired result:

 
|a|extracted_a|

|Hello, World, 1234|[Hello, 1234]|
|Good, bye, friend|null|

 


> Add a regexp_extract variant which returns an array of all the matched capture groups
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-39796
>                 URL: https://issues.apache.org/jira/browse/SPARK-39796
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: Augustine Theodore Prince
>            Priority: Minor
>              Labels: regexp_extract, regexp_extract_all, regexp_replace
>
>  
> regexp_extract only returns a single matched group. In a lot of cases we need to parse the entire string and get all the groups and for that we'll need to call it as many times as there are groups. The regexp_extract_all function doesn't solve this problem as it only works if all the groups have the same regex pattern.
>  
> _Example:_
> I will provide an example and the current workaround that I use to solve this,
> If I have the following dataframe and I would like to match the column 'a' with this pattern
> {code:java}
> "([A-Za-z]+), [A-Za-z]+, (\\d+)"{code}
> |a|
> |Hello, World, 1234|
> |Good, bye, friend|
>  
> My expected output  is as follows:
> |a|extracted_a|
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|[]|
>  
> However, to achieve this I have to take the following approach which seems very hackish.
> 1. Use regexp_replace to create a temporary string built using the extracted groups:
> {code:java}
> df.withColumn("extr" , F.regexp_replace("a", "([A-Za-z]+), [A-Za-z]+, (\\d+)", "$1_$2")){code}
> A side effect of regexp_replace is that if the regex fails to match the entire string is returned.
>  
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|Good, bye, friend|
> 2. So, to achieve the desired result, a check has to be done to prune the rows that did not match with the pattern :
> {code:java}
> df = df.withColumn("extracted_a" , F.when(F.col("extracted_a")==F.col("a") , None).otherwise(F.col("extracted_a"))){code}
>  
> to get the following intermediate dataframe,
> |a|extracted_a|
> |Hello, World, 1234|Hello_1234|
> |Good, bye, friend|null|
>  
> 3. Before finally splitting the column 'extracted_a' based on underscores
> {code:java}
> df = df.withColumn("extracted_a" , F.split("extracted_a" , "[_]")){code}
> which results in the desired result:
>  
>  
> |a|extracted_a
> |
> |Hello, World, 1234|[Hello, 1234]|
> |Good, bye, friend|null|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org