You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brian ONeill (JIRA)" <ji...@apache.org> on 2016/04/06 18:18:25 UTC

[jira] [Comment Edited] (SPARK-14421) Kinesis deaggregation with PySpark

    [ https://issues.apache.org/jira/browse/SPARK-14421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228351#comment-15228351 ] 

Brian ONeill edited comment on SPARK-14421 at 4/6/16 4:18 PM:
--------------------------------------------------------------

Interesting observation here.  I uprev'd the kinesis consumer library (which required a protobuf bump as well)  After that, the de-aggregation appeared to work. 

Spark 1.6.1 was using 1.4.0 of KCL.  That should have supported the de-aggregation, but under some circumstances it appeared not to work.

I went to:
{code}
 <aws.kinesis.client.version>1.6.2</aws.kinesis.client.version>
{code}

That required going to:
{code}
 <protobuf.version>2.6.1</protobuf.version>
{code}

After making those two changes, I was able to consume data from a stream, whose data was aggregated by KPL version: 
{code}
    compile (group:'com.amazonaws', name:'amazon-kinesis-producer', version:'0.10.2')
{code}


was (Author: boneill42):
Interesting observation here: 
I uprev'd the kinesis consumer library (which required a protobuf bump as well)

After that, the de-aggregation appeared to work.  However, we received this stack trace:
{code}
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47-48: ordinal not in range(128)

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


	at org.apache.spark.streaming.api.python.TransformFunction.callPythonTransformFunction(PythonDStream.scala:95)
	at org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:78)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:189)
	at org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:189)
{code}

> Kinesis deaggregation with PySpark
> ----------------------------------
>
>                 Key: SPARK-14421
>                 URL: https://issues.apache.org/jira/browse/SPARK-14421
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: PySpark w/ Kinesis word count example
>            Reporter: Brian ONeill
>
> I'm creating this issue as a precaution...
> We have some preliminary evidence that indicates that KPL de-aggregation for Kinesis streams may not work in Spark 1.6.1.  Using the PySpark Kinesis Word Count example, we don't receive records when KPL is used to produce the data, with aggregation turned on, using masterUrl = local[16].
> At the same time, I noticed this thread:
> https://forums.aws.amazon.com/message.jspa?messageID=707122
> Following the instructions here:
> http://brianoneill.blogspot.com/2016/03/pyspark-on-amazon-emr-w-kinesis.html
> The example will sometimes work.   When aggregation is disabled, it appears to always work.  I'm going to dig a bit deeper, but thought you might have some pointers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org