You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Nick Nicolini (JIRA)" <ji...@apache.org> on 2018/07/22 03:22:00 UTC

[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

    [ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551885#comment-16551885 ] 

Nick Nicolini commented on SPARK-16203:
---------------------------------------

[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the method shown above, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as [presto|https://prestodb.io/docs/current/functions/regexp.html] and [pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have?

 

 

 

 

 

> regexp_extract to return an ArrayType(StringType())
> ---------------------------------------------------
>
>                 Key: SPARK-16203
>                 URL: https://issues.apache.org/jira/browse/SPARK-16203
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case - e.g., web log parsing) we need to parse the entire line and get all the groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How would Spark know not to do multiple pattern matching operations, when only one is needed? Or does the optimizer actually check whether the patterns are identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org