You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/18 12:35:55 UTC
[GitHub] [spark] gaborgsomogyi commented on issue #24403: [SPARK-23014][SS]
Fully remove V1 memory sink.
gaborgsomogyi commented on issue #24403: [SPARK-23014][SS] Fully remove V1 memory sink.
URL: https://github.com/apache/spark/pull/24403#issuecomment-484487619
Just for the sake of understanding why python adaptation needed.
With V1 memory sink the following exception arrived:
```
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 428, in main
process()
File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 423, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 457, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/Users/gaborsomogyi/spark/python/pyspark/serializers.py", line 446, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/Users/gaborsomogyi/spark/python/lib/pyspark.zip/pyspark/worker.py", line 86, in <lambda>
return lambda *a: f(*a)
File "/Users/gaborsomogyi/spark/python/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "/Users/gaborsomogyi/spark/python/pyspark/sql/tests/test_streaming.py", line 217, in <lambda>
bad_udf = udf(lambda x: 1 / 0)
ZeroDivisionError: integer division or modulo by zero
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:25)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:748)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:852)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:852)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:291)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1349)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
=== Streaming Query ===
Identifier: this_query [id = fcda74c8-05e0-4c74-ac45-3444f3e50c63, runId = 7534a374-5d15-4986-bae1-8fa0664c647b]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
Project [<lambda>(value#4) AS <lambda>(value)#17]
+- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#4]
```
With V2 memory sink the following exception arrived (here additional exceptions are linked to the top level one, for instance `ZeroDivisionError`):
```
Writing job aborted.
=== Streaming Query ===
Identifier: this_query [id = d37ca2ea-edb7-4670-b4cc-75b0d704496e, runId = 587d62da-8355-4f66-a9d8-708c154ef953]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming]: {"logOffset":0}}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
WriteToMicroBatchDataSource org.apache.spark.sql.execution.streaming.sources.MemoryStreamingWrite@1a3691ef
+- Project [<lambda>(value#18) AS <lambda>(value)#32]
+- StreamingExecutionRelation FileStreamSource[file:/Users/gaborsomogyi/spark/python/test_support/sql/streaming], [value#18]
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org