You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Nick R. Katsipoulakis" <ka...@cs.pitt.edu> on 2014/07/17 23:27:20 UTC

InputSplit and RecordReader control on HadoopRDD

Hello,

I am currently trying to extend some custom InputSplit and RecordReader
classes to provide to SparkContext's hadoopRDD() function.

My question is the following:

Does the value returned by InpuSplit.getLenght() and/or
RecordReader.getProgress() affect the execution of a map() function in the
Spark runtime?

I am asking because I have used these two custom classes on Hadoop and they
do not cause any problems. However, in Spark, I see that new InputSplit
objects are generated during runtime. To be more precise:

In the beginning, I see in my log file that an InputSplit object is
generated and the RecordReader object associated to it is fetching records.
At some point, the job that is handling the previous InputSplit stops, and
a new one is spawned with a new InputSplit. I do not understand why this is
happening?

Any help?

Thank you,
Nick

P.S.-1 : I am sorry for posting my question on the Developer Mailing List,
but I could not find anything similar in the User's list. Also, I really
need to understand the runtime of Spark and I believe that in the
developer's list my question will be read by contributors of Spark.

P.S.-2: I can provide more technical details if they are needed.