You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Raviteja Lokineni <ra...@gmail.com> on 2016/11/09 15:48:16 UTC

Aggregations on every column on dataframe causing StackOverflowError

Hi all,

I am not sure if this is a bug or not. Basically I am generating weekly
aggregates of every column of data.

Adding source code here (also attached):

from pyspark.sql.window import Window
from pyspark.sql.functions import *

timeSeries = sqlContext.read.option("header",
"true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")

# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400

w = (Window()
     .partitionBy("id")
     .orderBy(col("dt").cast("timestamp").cast("long"))
     .rangeBetween(-days(6), 0))

cols = ["id", "dt"]
skipCols = ["id", "dt"]

for col in timeSeries.columns:
    if col in skipCols:
        continue
    cols.append(mean(col).over(w).alias("mean_7_"+col))
    cols.append(count(col).over(w).alias("count_7_"+col))
    cols.append(sum(col).over(w).alias("sum_7_"+col))
    cols.append(min(col).over(w).alias("min_7_"+col))
    cols.append(max(col).over(w).alias("max_7_"+col))

df = timeSeries.select(cols)
df.orderBy('id', 'dt').write\
    .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
    .save("file:///tmp/spark-bug-out.csv")


Thanks,
-- 
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokineni@gmail.com

[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>

Re: Aggregations on every column on dataframe causing StackOverflowError

Posted by Michael Armbrust <mi...@databricks.com>.

It would be great if you could try with the 2.0.2 RC.  Thanks for creating
an issue.

On Wed, Nov 9, 2016 at 1:22 PM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:

> Well I've tried with 1.5.2, 1.6.2 and 2.0.1
>
> FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388
>
> On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Which version of Spark?  Does seem like a bug.
>>
>> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
>> raviteja.lokineni@gmail.com> wrote:
>>
>>> Does this stacktrace look like a bug guys? Definitely seems like one to
>>> me.
>>>
>>> Caused by: java.lang.StackOverflowError
>>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>>>
>>>
>>> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
>>> raviteja.lokineni@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am not sure if this is a bug or not. Basically I am generating weekly
>>>> aggregates of every column of data.
>>>>
>>>> Adding source code here (also attached):
>>>>
>>>> from pyspark.sql.window import Window
>>>> from pyspark.sql.functions import *
>>>>
>>>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>>>
>>>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>>>> days = lambda i: i * 86400
>>>>
>>>> w = (Window()
>>>>      .partitionBy("id")
>>>>      .orderBy(col("dt").cast("timestamp").cast("long"))
>>>>      .rangeBetween(-days(6), 0))
>>>>
>>>> cols = ["id", "dt"]
>>>> skipCols = ["id", "dt"]
>>>>
>>>> for col in timeSeries.columns:
>>>>     if col in skipCols:
>>>>         continue
>>>>     cols.append(mean(col).over(w).alias("mean_7_"+col))
>>>>     cols.append(count(col).over(w).alias("count_7_"+col))
>>>>     cols.append(sum(col).over(w).alias("sum_7_"+col))
>>>>     cols.append(min(col).over(w).alias("min_7_"+col))
>>>>     cols.append(max(col).over(w).alias("max_7_"+col))
>>>>
>>>> df = timeSeries.select(cols)
>>>> df.orderBy('id', 'dt').write\
>>>>     .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>>>>     .save("file:///tmp/spark-bug-out.csv")
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> *Raviteja Lokineni* | Business Intelligence Developer
>>>> TD Ameritrade
>>>>
>>>> E: raviteja.lokineni@gmail.com
>>>>
>>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Raviteja Lokineni* | Business Intelligence Developer
>>> TD Ameritrade
>>>
>>> E: raviteja.lokineni@gmail.com
>>>
>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>
>>>
>>
>
>
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>

Re: Aggregations on every column on dataframe causing StackOverflowError

Posted by Raviteja Lokineni <ra...@gmail.com>.

Well I've tried with 1.5.2, 1.6.2 and 2.0.1

FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388

On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Which version of Spark?  Does seem like a bug.
>
> On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
> raviteja.lokineni@gmail.com> wrote:
>
>> Does this stacktrace look like a bug guys? Definitely seems like one to
>> me.
>>
>> Caused by: java.lang.StackOverflowError
>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
>> 	at scala.collection.immutable.List.foreach(List.scala:381)
>>
>>
>> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
>> raviteja.lokineni@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am not sure if this is a bug or not. Basically I am generating weekly
>>> aggregates of every column of data.
>>>
>>> Adding source code here (also attached):
>>>
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import *
>>>
>>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>>
>>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>>> days = lambda i: i * 86400
>>>
>>> w = (Window()
>>>      .partitionBy("id")
>>>      .orderBy(col("dt").cast("timestamp").cast("long"))
>>>      .rangeBetween(-days(6), 0))
>>>
>>> cols = ["id", "dt"]
>>> skipCols = ["id", "dt"]
>>>
>>> for col in timeSeries.columns:
>>>     if col in skipCols:
>>>         continue
>>>     cols.append(mean(col).over(w).alias("mean_7_"+col))
>>>     cols.append(count(col).over(w).alias("count_7_"+col))
>>>     cols.append(sum(col).over(w).alias("sum_7_"+col))
>>>     cols.append(min(col).over(w).alias("min_7_"+col))
>>>     cols.append(max(col).over(w).alias("max_7_"+col))
>>>
>>> df = timeSeries.select(cols)
>>> df.orderBy('id', 'dt').write\
>>>     .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>>>     .save("file:///tmp/spark-bug-out.csv")
>>>
>>>
>>> Thanks,
>>> --
>>> *Raviteja Lokineni* | Business Intelligence Developer
>>> TD Ameritrade
>>>
>>> E: raviteja.lokineni@gmail.com
>>>
>>> [image: View Raviteja Lokineni's profile on LinkedIn]
>>> <http://in.linkedin.com/in/ravitejalokineni>
>>>
>>>
>>
>>
>> --
>> *Raviteja Lokineni* | Business Intelligence Developer
>> TD Ameritrade
>>
>> E: raviteja.lokineni@gmail.com
>>
>> [image: View Raviteja Lokineni's profile on LinkedIn]
>> <http://in.linkedin.com/in/ravitejalokineni>
>>
>>
>


-- 
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokineni@gmail.com

[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>

Re: Aggregations on every column on dataframe causing StackOverflowError

Posted by Michael Armbrust <mi...@databricks.com>.

Which version of Spark?  Does seem like a bug.

On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:

> Does this stacktrace look like a bug guys? Definitely seems like one to me.
>
> Caused by: java.lang.StackOverflowError
> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
> 	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
> 	at scala.collection.immutable.List.foreach(List.scala:381)
>
>
> On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
> raviteja.lokineni@gmail.com> wrote:
>
>> Hi all,
>>
>> I am not sure if this is a bug or not. Basically I am generating weekly
>> aggregates of every column of data.
>>
>> Adding source code here (also attached):
>>
>> from pyspark.sql.window import Window
>> from pyspark.sql.functions import *
>>
>> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>>
>> # Hive timestamp is interpreted as UNIX timestamp in seconds*
>> days = lambda i: i * 86400
>>
>> w = (Window()
>>      .partitionBy("id")
>>      .orderBy(col("dt").cast("timestamp").cast("long"))
>>      .rangeBetween(-days(6), 0))
>>
>> cols = ["id", "dt"]
>> skipCols = ["id", "dt"]
>>
>> for col in timeSeries.columns:
>>     if col in skipCols:
>>         continue
>>     cols.append(mean(col).over(w).alias("mean_7_"+col))
>>     cols.append(count(col).over(w).alias("count_7_"+col))
>>     cols.append(sum(col).over(w).alias("sum_7_"+col))
>>     cols.append(min(col).over(w).alias("min_7_"+col))
>>     cols.append(max(col).over(w).alias("max_7_"+col))
>>
>> df = timeSeries.select(cols)
>> df.orderBy('id', 'dt').write\
>>     .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>>     .save("file:///tmp/spark-bug-out.csv")
>>
>>
>> Thanks,
>> --
>> *Raviteja Lokineni* | Business Intelligence Developer
>> TD Ameritrade
>>
>> E: raviteja.lokineni@gmail.com
>>
>> [image: View Raviteja Lokineni's profile on LinkedIn]
>> <http://in.linkedin.com/in/ravitejalokineni>
>>
>>
>
>
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>

Re: Aggregations on every column on dataframe causing StackOverflowError

Posted by Raviteja Lokineni <ra...@gmail.com>.

Does this stacktrace look like a bug guys? Definitely seems like one to me.

Caused by: java.lang.StackOverflowError
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)
	at scala.collection.immutable.List.foreach(List.scala:381)


On Wed, Nov 9, 2016 at 10:48 AM, Raviteja Lokineni <
raviteja.lokineni@gmail.com> wrote:

> Hi all,
>
> I am not sure if this is a bug or not. Basically I am generating weekly
> aggregates of every column of data.
>
> Adding source code here (also attached):
>
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
>
> timeSeries = sqlContext.read.option("header", "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
>
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
>
> w = (Window()
>      .partitionBy("id")
>      .orderBy(col("dt").cast("timestamp").cast("long"))
>      .rangeBetween(-days(6), 0))
>
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
>
> for col in timeSeries.columns:
>     if col in skipCols:
>         continue
>     cols.append(mean(col).over(w).alias("mean_7_"+col))
>     cols.append(count(col).over(w).alias("count_7_"+col))
>     cols.append(sum(col).over(w).alias("sum_7_"+col))
>     cols.append(min(col).over(w).alias("min_7_"+col))
>     cols.append(max(col).over(w).alias("max_7_"+col))
>
> df = timeSeries.select(cols)
> df.orderBy('id', 'dt').write\
>     .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
>     .save("file:///tmp/spark-bug-out.csv")
>
>
> Thanks,
> --
> *Raviteja Lokineni* | Business Intelligence Developer
> TD Ameritrade
>
> E: raviteja.lokineni@gmail.com
>
> [image: View Raviteja Lokineni's profile on LinkedIn]
> <http://in.linkedin.com/in/ravitejalokineni>
>
>


-- 
*Raviteja Lokineni* | Business Intelligence Developer
TD Ameritrade

E: raviteja.lokineni@gmail.com

[image: View Raviteja Lokineni's profile on LinkedIn]
<http://in.linkedin.com/in/ravitejalokineni>