You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Oscar Brück (Jira)" <ji...@apache.org> on 2020/06/14 20:14:00 UTC
[jira] [Updated] (SPARK-31991) The SparkR regexp_replace function
causes problems
[ https://issues.apache.org/jira/browse/SPARK-31991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oscar Brück updated SPARK-31991:
--------------------------------
Description:
The SparkR regex functions in R are not working as they should. Here's a reprex.
{code:java}
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(
test = as.character(test),
test1 = regexp_replace(test, "Less ", "<"),
test1 = regexp_replace(test1, "A1", "<1"),
test1 = regexp_replace(test1, "Over ", ">"),
test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
test3 = regexp_replace(test2, "[\\,]", "aa"),
test3 = regexp_replace(test3, " ", ""))
df2 <- df %>%
dplyr::mutate(
test = as.character(test),
test = regexp_replace(test, "Less ", "<"),
test = regexp_replace(test, "A1", "<1"),
test = regexp_replace(test, "Over ", ">"),
test = regexp_replace(test, "[a-zA-Z]+", ""),
test = regexp_replace(test, "[\\,]", "aa"),
test = regexp_replace(test, " ", ""))
# Collect and print
df1_1 <- df1 %>% as.data.frame()
df1_1
df2_1 <- df2 %>% as.data.frame()
df2_1
{code}
The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical.
I find SparkR really great, but would be eager to use R regex instead of java regex.
was:
The SparkR regex functions are not working as they should. Here's a reprex.
{code:java}
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(
test = as.character(test),
test1 = regexp_replace(test, "Less ", "<"),
test1 = regexp_replace(test1, "A1", "<1"),
test1 = regexp_replace(test1, "Over ", ">"),
test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
test3 = regexp_replace(test2, "[\\,]", "aa"),
test3 = regexp_replace(test3, " ", ""))
df2 <- df %>%
dplyr::mutate(
test = as.character(test),
test = regexp_replace(test, "Less ", "<"),
test = regexp_replace(test, "A1", "<1"),
test = regexp_replace(test, "Over ", ">"),
test = regexp_replace(test, "[a-zA-Z]+", ""),
test = regexp_replace(test, "[\\,]", "aa"),
test = regexp_replace(test, " ", ""))
# Collect and print
df1_1 <- df1 %>% as.data.frame()
df1_1
df2_1 <- df2 %>% as.data.frame()
df2_1
{code}
The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical.
I find SparkR really great, but would be eager to use R regex instead of java regex.
Summary: The SparkR regexp_replace function causes problems (was: Regexp_replace causing problems)
> The SparkR regexp_replace function causes problems
> --------------------------------------------------
>
> Key: SPARK-31991
> URL: https://issues.apache.org/jira/browse/SPARK-31991
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.4.5
> Reporter: Oscar Brück
> Priority: Major
>
> The SparkR regex functions in R are not working as they should. Here's a reprex.
>
> {code:java}
> # Load packages
> library(tidyverse)
> library(sparklyr)
> library(SparkR)
> # Create data
> df <- data.frame(test = c("Less 2", "A1,2", "Over 2", "Resp1 1aa"))
> # Transfer data to Spark memory
> df <- copy_to(sc, df, "df", overwrite = TRUE)
> # Modify data
> df1 <- df %>%
> dplyr::mutate(
> test = as.character(test),
> test1 = regexp_replace(test, "Less ", "<"),
> test1 = regexp_replace(test1, "A1", "<1"),
> test1 = regexp_replace(test1, "Over ", ">"),
> test2 = regexp_replace(test1, "[a-zA-Z]+", ""),
> test3 = regexp_replace(test2, "[\\,]", "aa"),
> test3 = regexp_replace(test3, " ", ""))
> df2 <- df %>%
> dplyr::mutate(
> test = as.character(test),
> test = regexp_replace(test, "Less ", "<"),
> test = regexp_replace(test, "A1", "<1"),
> test = regexp_replace(test, "Over ", ">"),
> test = regexp_replace(test, "[a-zA-Z]+", ""),
> test = regexp_replace(test, "[\\,]", "aa"),
> test = regexp_replace(test, " ", ""))
> # Collect and print
> df1_1 <- df1 %>% as.data.frame()
> df1_1
> df2_1 <- df2 %>% as.data.frame()
> df2_1
> {code}
> The column test3 in df1_1 is correct but the column test in df2_1 is not, although the input, regex patterns and the replacements are identical.
> I find SparkR really great, but would be eager to use R regex instead of java regex.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org