You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Raviteja Lokineni <ra...@gmail.com> on 2016/11/09 15:48:16 UTC
Aggregations on every column on dataframe causing StackOverflowError
Hi all,
I am not sure if this is a bug or not. Basically I am generating weekly
aggregates of every column of data.
Adding source code here (also attached):
from pyspark.sql.window import Window
from pyspark.sql.functions import *
timeSeries = sqlContext.read.option("header",
"true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
w = (Window()
.partitionBy("id")
.orderBy(col("dt").cast("timestamp").cast("long"))
.rangeBetween(-days(6), 0))
cols = ["id", "dt"]
skipCols = ["id", "dt"]
for col in timeSeries.columns:
if col in skipCols:
continue
cols.append(mean(col).over(w).alias("mean_7_"+col))
cols.append(count(col).over(w).alias("count_7_"+col))
cols.append(sum(col).over(w).alias("sum_7_"+col))
cols.append(min(col).over(w).alias("min_7_"+col))
cols.append(max(col).over(w).alias("max_7_"+col))
df = timeSeries.select(cols)
df.orderBy('id', 'dt').write\
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
.save("file:///tmp/spark-bug-out.csv")
Thanks,
--
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade
E: raviteja.lokineni@gmail.com
[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>
Re: Aggregations on every column on dataframe causing StackOverflowError
Posted by Michael Armbrust <mi...@databricks.com>.
It would be great if you could try with the 2.0.2 RC. Thanks for creating
an issue.
On Wed, Nov 9, 2016 at 1:22 PM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:
> Well I've tried with 1.5.2, 1.6.2 and 2.0.1
>
> FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388
>
> On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Which version of Spark? Does seem like a bug.
>>
>> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
>> raviteja.lokineni@gmail.com> wrote:
>>
>>> Does this stacktrace look like a bug guys? Definitely seems like one to
>>> me.
>>>
>>> Caused by: java.lang.StackOverflowError
>>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> at scala.collection.immutable.List.foreach(List.scala:381)
>>>
>>>
>>> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
>>> raviteja.lokineni@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am not sure if this is a bug or not. Basically I am generating weekly
>>>> aggregates of every column of data.
>>>>
>>>> Adding source code here (also attached):
>>>>
>>>> from pyspark.sql.window import Window
>>>> from pyspark.sql.functions import *
>>>>
>>>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>>>
>>>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>>>> days = lambda i: i * 86400
>>>>
>>>> w = (Window()
>>>> .partitionBy("id")
>>>> .orderBy(col("dt").cast("timestamp").cast("long"))
>>>> .rangeBetween(-days(6), 0))
>>>>
>>>> cols = ["id", "dt"]
>>>> skipCols = ["id", "dt"]
>>>>
>>>> for col in timeSeries.columns:
>>>> if col in skipCols:
>>>> continue
>>>> cols.append(mean(col).over(w).alias("mean_7_"+col))
>>>> cols.append(count(col).over(w).alias("count_7_"+col))
>>>> cols.append(sum(col).over(w).alias("sum_7_"+col))
>>>> cols.append(min(col).over(w).alias("min_7_"+col))
>>>> cols.append(max(col).over(w).alias("max_7_"+col))
>>>>
>>>> df = timeSeries.select(cols)
>>>> df.orderBy('id', 'dt').write\
>>>> .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>>>> .save("file:///tmp/spark-bug-out.csv")
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> *Raviteja Lokineni* | Business Intelligence Developer
>>>> TD Ameritrade
>>>>
>>>> E: raviteja.lokineni@gmail.com
>>>>
>>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Raviteja Lokineni* | Business Intelligence Developer
>>> TD Ameritrade
>>>
>>> E: raviteja.lokineni@gmail.com
>>>
>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>
>>>
>>
>
>
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>
Re: Aggregations on every column on dataframe causing StackOverflowError
Posted by Raviteja Lokineni <ra...@gmail.com>.
Well I've tried with 1.5.2, 1.6.2 and 2.0.1
FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388
On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> Which version of Spark? Does seem like a bug.
>
> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
> raviteja.lokineni@gmail.com> wrote:
>
>> Does this stacktrace look like a bug guys? Definitely seems like one to
>> me.
>>
>> Caused by: java.lang.StackOverflowError
>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> at scala.collection.immutable.List.foreach(List.scala:381)
>>
>>
>> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
>> raviteja.lokineni@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am not sure if this is a bug or not. Basically I am generating weekly
>>> aggregates of every column of data.
>>>
>>> Adding source code here (also attached):
>>>
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import *
>>>
>>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>>
>>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>>> days = lambda i: i * 86400
>>>
>>> w = (Window()
>>> .partitionBy("id")
>>> .orderBy(col("dt").cast("timestamp").cast("long"))
>>> .rangeBetween(-days(6), 0))
>>>
>>> cols = ["id", "dt"]
>>> skipCols = ["id", "dt"]
>>>
>>> for col in timeSeries.columns:
>>> if col in skipCols:
>>> continue
>>> cols.append(mean(col).over(w).alias("mean_7_"+col))
>>> cols.append(count(col).over(w).alias("count_7_"+col))
>>> cols.append(sum(col).over(w).alias("sum_7_"+col))
>>> cols.append(min(col).over(w).alias("min_7_"+col))
>>> cols.append(max(col).over(w).alias("max_7_"+col))
>>>
>>> df = timeSeries.select(cols)
>>> df.orderBy('id', 'dt').write\
>>> .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>>> .save("file:///tmp/spark-bug-out.csv")
>>>
>>>
>>> Thanks,
>>> --
>>> *Raviteja Lokineni* | Business Intelligence Developer
>>> TD Ameritrade
>>>
>>> E: raviteja.lokineni@gmail.com
>>>
>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>
>>>
>>
>>
>> --
>> *Raviteja Lokineni* | Business Intelligence Developer
>> TD Ameritrade
>>
>> E: raviteja.lokineni@gmail.com
>>
>> [image: View Raviteja Lokineni's profile on LinkedIn]
>> <http://in.linkedin.com/in/ravitejalokineni>
>>
>>
>
--
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade
E: raviteja.lokineni@gmail.com
[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>
Re: Aggregations on every column on dataframe causing StackOverflowError
Posted by Michael Armbrust <mi...@databricks.com>.
Which version of Spark? Does seem like a bug.
On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:
> Does this stacktrace look like a bug guys? Definitely seems like one to me.
>
> Caused by: java.lang.StackOverflowError
> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> at scala.collection.immutable.List.foreach(List.scala:381)
>
>
> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
> raviteja.lokineni@gmail.com> wrote:
>
>> Hi all,
>>
>> I am not sure if this is a bug or not. Basically I am generating weekly
>> aggregates of every column of data.
>>
>> Adding source code here (also attached):
>>
>> from pyspark.sql.window import Window
>> from pyspark.sql.functions import *
>>
>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>
>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>> days = lambda i: i * 86400
>>
>> w = (Window()
>> .partitionBy("id")
>> .orderBy(col("dt").cast("timestamp").cast("long"))
>> .rangeBetween(-days(6), 0))
>>
>> cols = ["id", "dt"]
>> skipCols = ["id", "dt"]
>>
>> for col in timeSeries.columns:
>> if col in skipCols:
>> continue
>> cols.append(mean(col).over(w).alias("mean_7_"+col))
>> cols.append(count(col).over(w).alias("count_7_"+col))
>> cols.append(sum(col).over(w).alias("sum_7_"+col))
>> cols.append(min(col).over(w).alias("min_7_"+col))
>> cols.append(max(col).over(w).alias("max_7_"+col))
>>
>> df = timeSeries.select(cols)
>> df.orderBy('id', 'dt').write\
>> .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>> .save("file:///tmp/spark-bug-out.csv")
>>
>>
>> Thanks,
>> --
>> *Raviteja Lokineni* | Business Intelligence Developer
>> TD Ameritrade
>>
>> E: raviteja.lokineni@gmail.com
>>
>> [image: View Raviteja Lokineni's profile on LinkedIn]
>> <http://in.linkedin.com/in/ravitejalokineni>
>>
>>
>
>
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>
Re: Aggregations on every column on dataframe causing StackOverflowError
Posted by Raviteja Lokineni <ra...@gmail.com>.
Does this stacktrace look like a bug guys? Definitely seems like one to me.
Caused by: java.lang.StackOverflowError
at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
at scala.collection.immutable.List.foreach(List.scala:381)
On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:
> Hi all,
>
> I am not sure if this is a bug or not. Basically I am generating weekly
> aggregates of every column of data.
>
> Adding source code here (also attached):
>
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
>
> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
>
> w = (Window()
> .partitionBy("id")
> .orderBy(col("dt").cast("timestamp").cast("long"))
> .rangeBetween(-days(6), 0))
>
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
>
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
>
> df = timeSeries.select(cols)
> df.orderBy('id', 'dt').write\
> .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
> .save("file:///tmp/spark-bug-out.csv")
>
>
> Thanks,
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>
--
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade
E: raviteja.lokineni@gmail.com
[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>