You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kostas Kougios <ko...@googlemail.com> on 2015/07/02 18:05:31 UTC

wholeTextFiles("/x/*/*.txt") runs single threaded

Hi, I got a cluster of 4 machines and I

sc.wholeTextFiles("/x/*/*.txt")

folder x contains subfolders and each subfolder contains thousand of files
with a total of ~1million matching the path expression.

My spark task starts processing the files but single threaded. I can see
that in the sparkUI, only 1 executor is used out of 4. And only 1 thread out
of configured 24:

spark-submit --class com.stratified.articleids.NxmlExtractorJob \
	--driver-memory 8g \
	--executor-memory 8g \
	--num-executors 4 \
	--executor-cores 16 \
	--master yarn-cluster \
	--conf spark.akka.frameSize=128 \
	$JAR


My actual code is :

      val rdd=extractIds(sc.wholeTextFiles(xmlDir))
      rdd.saveAsObjectFile(serDir)

Is the saveAsObjectFile causing this and any workarounds?





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: wholeTextFiles("/x/*/*.txt") runs single threaded

Posted by Kostas Kougios <ko...@googlemail.com>.
In SparkUI I can see it creating 2 stages. I tried
wholeTextFiles().repartition(32) but same threading results.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-x-txt-runs-single-threaded-tp23591p23593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org