You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Martin Engen <Ma...@outlook.com> on 2018/05/15 13:23:29 UTC

Structured Streaming, Reading and Updating a variable

Hello,

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.
To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

Implementing this is a much bigger challenge then I ever imagined.


I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error
"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

How can you read/update a variable while doing calculations using Structured Streaming?

Thank you

Re: Structured Streaming, Reading and Updating a variable

Posted by Martin Engen <Ma...@outlook.com>.

I have been testing some with aggregations, but I seem to hit a wall on two issues.
example:
val avg = areaStateDf.groupBy($"plantKey").avg("sensor")

1) How can I use the result from an aggr within the same stream, to do further calculations?
2) It seems to be very slow. If I want a moving window of 24 hours, and to have a moving average on some calculations within this. When testing locally with using

The Accumulator issue:
Simple Counter Accumulator:
object Test {
  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._
  import com.datastax.spark.connector._
  val counter = spark.sparkContext.longAccumulator("counter")

  val fetchData = () => {
    counter.add(2)
    counter.value
  }

  val fetchdataUDF = spark.sqlContext.udf.register("testUDF", fetchData)

  def calculate(areaStateDf: DataFrame): StreamingQuery = {
    import spark.implicits._
    val ds = areaStateDf.select($"areaKey").withColumn("fetchedData", fetchdataUDF())
    KafkaSinks.debugStream(ds, "volumTest")
  }
}
I would create a custom accumulator to include a smoothing algorithm, but cant seem to be able to get a normal counter working.
This works locally, but on the server running Docker (using a master and 1 worker) throws this error:

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 172.17.0.5, executor 0): java.lang.ExceptionInInitializerError
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at com.cambi.assurance.spark.SparkHelper$.getSparkSession(SparkHelper.scala:28)
at com.client.spark.calculations.Test$.<init>(ThpLoad1.scala:10)
at com.client.spark.calculations.Test$.<clinit>(ThpLoad1.scala)
... 18 more

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 3.0 (TID 4, 172.17.0.5, executor 0): java.lang.NoClassDefFoundError: Could not initialize class com.client.spark.calculations.Test$
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)


Any ideas about how to handle this error?


        Thanks,
        Martin Engen
________________________________
From: Lalwani, Jayesh <Ja...@capitalone.com>
Sent: Tuesday, May 15, 2018 9:59 PM
To: Martin Engen; user@spark.apache.org
Subject: Re: Structured Streaming, Reading and Updating a variable


Do you have a code sample, and detailed error message/exception to show?



From: Martin Engen <Ma...@outlook.com>
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Structured Streaming, Reading and Updating a variable



Hello,



I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.



Implementing this is a much bigger challenge then I ever imagined.





I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary



How can you read/update a variable while doing calculations using Structured Streaming?



Thank you





________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Structured Streaming, Reading and Updating a variable

Posted by "Lalwani, Jayesh" <Ja...@capitalone.com>.

Do you have a code sample, and detailed error message/exception to show?

From: Martin Engen <Ma...@outlook.com>
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: Structured Streaming, Reading and Updating a variable

Hello,

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.
To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

Implementing this is a much bigger challenge then I ever imagined.

I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error
"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

How can you read/update a variable while doing calculations using Structured Streaming?

Thank you

________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Structured Streaming, Reading and Updating a variable

Posted by Koert Kuipers <ko...@tresata.com>.

You use a windowed aggregation for this

On Tue, May 15, 2018, 09:23 Martin Engen <Ma...@outlook.com> wrote:

> Hello,
>
>
>
> I'm working with Structured Streaming, and I need a method of keeping a
> running average based on last 24hours of data.
>
> To help with this, I can use Exponential Smoothing, which means I really
> only need to store 1 value from a previous calculation into the new, and
> update this variable as calculations carry on.
>
>
>
> Implementing this is a much bigger challenge then I ever imagined.
>
>
>
>
>
> I've tried using Accumulators and to Query/Store data to Cassandra after
> every calculation. Both methods worked somewhat locally , but I don't seem
> to be able to use these in the Spark Worker Nodes,  as I get the error
> "java.lang.NoClassDefFoundError: Could not initialize class error" both
> for the accumulator and the cassandra connection libary
>
>
>
> How can you read/update a variable while doing calculations using
> Structured Streaming?
>
>
> Thank you
>
>
>
>