You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matthew Livesey (JIRA)" <ji...@apache.org> on 2016/06/24 10:18:16 UTC
[jira] [Created] (SPARK-16191) Code-Generated
SpecificColumnarIterator fails for wide pivot with caching
Matthew Livesey created SPARK-16191:
---------------------------------------
Summary: Code-Generated SpecificColumnarIterator fails for wide pivot with caching
Key: SPARK-16191
URL: https://issues.apache.org/jira/browse/SPARK-16191
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.6.1
Reporter: Matthew Livesey
When caching a pivot of more than 2260 columns, the instance of
SpecificColumnarIterator which is generated by code-generation fails to be compiled with:
bq. failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method \"()Z\" of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator\" grows beyond 64 KB
This can be re-produced in PySpark with the following (it took some trial and error to find that 2261 is the magic number at which the generated class breaks the 64KB limit).
{code}
def build_pivot(width):
categories = ["cat_%s" % i for i in range(0,width)]
customers = ["cust_%s" % i for i in range(0,10)]
rows = []
for cust in customers:
for cat in categories:
for i in range(0,4):
row = (cust, cat, i, 7.0)
rows.append(row)
rdd = sc.parallelize(rows)
df = sqlContext.createDataFrame(rdd, ["customer", "category", "instance", "value"])
pivot_value_rows = df.select("category").distinct().orderBy("category").collect()
pivot_values = [r.category for r in pivot_value_rows]
import pyspark.sql.functions as func
pivot = df.groupBy('customer').pivot("category", pivot_values).agg(func.sum(df.value)).cache()
pivot.write.save('my_pivot', mode='overwrite')
for i in [2260, 2261]:
try:
build_pivot(i)
print "Succeeded for %s" % i
except:
print "Failed for %s" % i
{code}
Removing the `cache()` call avoids the problem and allows wider pivots, since ColumnarIterator is specifically related to caching it does not get generated where caching is not used.
This could be symptomatic of a general problem that generated code can break the 64KB bytecode limit, and so occur in other cases as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org