You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brian ONeill (JIRA)" <ji...@apache.org> on 2016/04/06 16:35:25 UTC
[jira] [Commented] (SPARK-14421) Kinesis deaggregation with PySpark

    [ https://issues.apache.org/jira/browse/SPARK-14421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228351#comment-15228351 ] 

Brian ONeill commented on SPARK-14421:
--------------------------------------

Interesting observation here: 
I uprev'd the kinesis consumer library (which required a protobuf bump as well)

After that, the de-aggregation appeared to work.  However, we received this stack trace:
{code}
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-48: ordinal not in range(128)

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


	at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
	at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:189)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:189)
{code}

> Kinesis deaggregation with PySpark
> ----------------------------------
>
>                 Key: SPARK-14421
>                 URL: https://issues.apache.org/jira/browse/SPARK-14421
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: PySpark w/ Kinesis word count example
>            Reporter: Brian ONeill
>
> I'm creating this issue as a precaution...
> We have some preliminary evidence that indicates that KPL de-aggregation for Kinesis streams may not work in Spark 1.6.1.  Using the PySpark Kinesis Word Count example, we don't receive records when KPL is used to produce the data, with aggregation turned on, using masterUrl = local[16].
> At the same time, I noticed this thread:
> https://forums.aws.amazon.com/message.jspa?messageID=707122
> Following the instructions here:
> http://brianoneill.blogspot.com/2016/03/pyspark-on-amazon-emr-w-kinesis.html
> The example will sometimes work.   When aggregation is disabled, it appears to always work.  I'm going to dig a bit deeper, but thought you might have some pointers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org