You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2017/01/09 23:19:58 UTC
[jira] [Resolved] (SPARK-18866) Codegen fails with cryptic error if
regexp_replace() output column is not aliased
[ https://issues.apache.org/jira/browse/SPARK-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-18866.
--------------------------------
Resolution: Duplicate
Fix Version/s: 2.2.0
2.1.1
> Codegen fails with cryptic error if regexp_replace() output column is not aliased
> ---------------------------------------------------------------------------------
>
> Key: SPARK-18866
> URL: https://issues.apache.org/jira/browse/SPARK-18866
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.0.2, 2.1.0
> Environment: Java 8, Python 3.5
> Reporter: Nicholas Chammas
> Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Here's a minimal repro:
> {code}
> import pyspark
> from pyspark.sql import Column
> from pyspark.sql.functions import regexp_replace, lower, col
> def normalize_udf(column: Column) -> Column:
> normalized_column = (
> regexp_replace(
> column,
> pattern='[\s]+',
> replacement=' ',
> )
> )
> return normalized_column
> if __name__ == '__main__':
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> raw_df = spark.createDataFrame(
> [(' ',)],
> ['string'],
> )
> normalized_df = raw_df.select(normalize_udf('string'))
> normalized_df_prime = (
> normalized_df
> .groupBy(sorted(normalized_df.columns))
> .count())
> normalized_df_prime.show()
> {code}
> When I run this I get:
> {code}
> ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 130: Invalid escape sequence
> {code}
> Followed by a huge barf of generated Java code, _and then the output I expect_. (So despite the scary error, the code actually works!)
> Can you spot the error in my code?
> It's simple: I just need to alias the output of {{normalize_udf()}} and all is forgiven:
> {code}
> normalized_df = raw_df.select(normalize_udf('string').alias('string'))
> {code}
> Of course, it's impossible to tell that from the current error output. So my *first question* is: Is there some way we can better communicate to the user what went wrong?
> Another interesting thing I noticed is that if I try this:
> {code}
> normalized_df = raw_df.select(lower('string'))
> {code}
> I immediately get a clean error saying:
> {code}
> py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lower. Trace:
> py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> {code}
> I can fix this by building a column object:
> {code}
> normalized_df = raw_df.select(lower(col('string')))
> {code}
> So that raises *a second problem/question*: Why does {{lower()}} require that I build a Column object, whereas {{regexp_replace()}} does not? The inconsistency adds to the confusion here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org