You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Roch Denis <rd...@exostatic.com> on 2014/07/16 04:37:38 UTC
No parallelism in map transformation
Hello,
Obviously I'm new to spark and I assume I'm missing something really obvious
but all my map operations are run on only one processor even if they have
many partitions. I've tried to google for the issue but everything seems
good, I use local[8] and my file has more than one partition ( checked with
_jrdd.splits().size() and I repartitioned to make sure ).
I run my test program using the following command: ./bin/spark-submit
--master local[8] session_tracking_spark.py
The code itself:
"""SimpleApp.py"""
from pyspark import SparkContext
from dateutil.parser import parse
from datetime import datetime
from datetime import timedelta
import json
if __name__ == "__main__":
class ParsedLogLine:
def __init__(self):
self.logLineColumns = None
self.logTime = None
self.msgType = None
self.msgContent = None
def parse_line(line):
line = line.rstrip()
results = ParsedLogLine()
results.logLineColumns = line.split("|")
if len(results.logLineColumns) == 6:
results.logTime = parse( results.logLineColumns[1] ) - timedelta(hours=3)
results.msgContent = json.loads( results.logLineColumns[5] )
results.hoplonDbLogLine = results.logLineColumns[0]
return ( results.msgContent["GameSessionId"], results )
# logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your
system
logFile = "/home/rdenis/full_sessions.txt" # Should be some file on your
system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile)
countNb = logData.map(parse_line).count()
print "count:", countNb, "partition nb:", logData._jrdd.splits().size()
Thanks for the help in advance!
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-parallelism-in-map-transformation-tp9863.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: No parallelism in map transformation
Posted by Roch Denis <rd...@exostatic.com>.
Well, for what it's worth I found the answer on the Mesos spark
documentation:
https://github.com/mesos/spark/wiki/Spark-Programming-GuideThe quick start
guide, say to use "--master local[4]" with spark-submit and that implies
that it would indicate to use more than on processor. However that doesn't
work, the example context creation needs to be change from this: "sc =
SparkContext("*local*", "Simple App")" to this "sc =
SparkContext("*local[*]"*, "Simple App")"
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-parallelism-in-map-transformation-tp9863p9870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.