You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2014/11/13 16:34:37 UTC

Building a hash table from a csv file using yarn-cluster, and giving it to each executor

I built my Spark Streaming app on my local machine, and an initial step in
log processing is filtering out rows with spam IPs.  I use the following
code which works locally:

    // Creates a HashSet for badIPs read in from file
    val badIpSource = scala.io.Source.fromFile("wrongIPlist.csv")
    val ipLines = badIpSource.getLines()
    

    val set = new HashSet[String]()
    val badIpSet = set ++ ipLines
    badIpSource.close()

    def isGoodIp(ip: String): Boolean = !badIpSet.contains(ip)

But when I try this using "--master yarn-cluster" I get "Exception in thread
"Thread-4" java.lang.reflect.InvocationTargetException ... Caused by:
java.io.FileNotFoundException: wrongIPlist.csv (No such file or directory)". 
The file is there (I wasn't sure which directory it was accessing so it's in
both my current client directory and my HDFS home directory), so now I'm
wondering if reading a file in parallel is just not allowed in general and
that's why I'm getting the error.

I'd like each executor to have access to this HashSet (not a huge file,
about 3000 IPs) instead of having to do a more expensive JOIN.  Any
recommendations on a better way to handle this?  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Building a hash table from a csv file using yarn-cluster, and giving it to each executor

Posted by aappddeevv <aa...@gmail.com>.

If the file is not present on each node, it may not find it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850p18877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org