You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alonso Isidoro Roman <al...@gmail.com> on 2016/06/01 07:53:19 UTC

Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

Thank you David, i will try to follow your advise.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-05-31 21:28 GMT+02:00 David Newberger <da...@wandcorp.com>:

> Have you tried it without either of the setMaster lines?
>
>
> Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using
> the cloudera repo for spark files in build sbt. I’d also check other files
> in the build sbt to see if there are cdh specific versions.
>
>
>
> *David Newberger*
>
>
>
> *From:* Alonso Isidoro Roman [mailto:alonsoir@gmail.com]
> *Sent:* Tuesday, May 31, 2016 1:23 PM
> *To:* David Newberger
> *Cc:* user@spark.apache.org
> *Subject:* Re: About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> Hi David, the one of the develop branch. I think It should be the same,
> but actually not sure...
>
>
>
> Regards
>
>
> *Alonso Isidoro Roman*
>
> about.me/alonso.isidoro.roman
>
>
>
> 2016-05-31 19:40 GMT+02:00 David Newberger <da...@wandcorp.com>:
>
> Is
> https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
>   the build.sbt you are using?
>
>
>
> *David Newberger*
>
> QA Analyst
>
> *WAND*  -  *The Future of Restaurant Technology*
>
> (W)  www.wandcorp.com
>
> (E)   david.newberger@wandcorp.com
>
> (P)   952.361.6200
>
>
>
> *From:* Alonso [mailto:alonsoir@gmail.com]
> *Sent:* Tuesday, May 31, 2016 11:11 AM
> *To:* user@spark.apache.org
> *Subject:* About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using
> OS X as my development machine, and the cdh image to run the code, i upload
> the code using git to the cdh image, i have modified my /etc/hosts file
> located in the cdh image with a line like this:
>
> 127.0.0.1       quickstart.cloudera     quickstart      localhost       localhost.domain
>
>
>
> 192.168.30.138       quickstart.cloudera     quickstart      localhost       localhost.domain
>
> The cloudera version that i am running is:
>
> [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties
>
>
>
> # Autogenerated build properties
>
> version=2.6.0-cdh5.7.0
>
> git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
>
> cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
>
> cloudera.base-branch=cdh5-base-2.6.0
>
> cloudera.build-branch=cdh5-2.6.0_5.7.0
>
> cloudera.pkg.version=2.6.0+cdh5.7.0+1280
>
> cloudera.pkg.release=1.cdh5.7.0.p0.92
>
> cloudera.cdh.release=cdh5.7.0
>
> cloudera.build.time=2016.03.23-18:30:29GMT
>
> I can do a ls command in the vmware machine:
>
> [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
>
> -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv
>
> I can read its content:
>
> [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
>
> 568454
>
> The code is quite simple, just trying to map its content:
>
> val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"
>
>
>
> case class AmazonRating(userId: String, productId: String, rating: Double)
>
>
>
> val NumRecommendations = 10
>
> val MinRecommendationsPerUser = 10
>
> val MaxRecommendationsPerUser = 20
>
> val MyUsername = "myself"
>
> val NumPartitions = 20
>
>
>
>
>
> println("Using this ratingFile: " + ratingFile)
>
>   // first create an RDD out of the rating file
>
> val rawTrainingRatings = sc.textFile(ratingFile).map {
>
>     line =>
>
>       val Array(userId, productId, scoreStr) = line.split(",")
>
>       AmazonRating(userId, productId, scoreStr.toDouble)
>
> }
>
>
>
>   // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()
>
>
>
> println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")
>
> I am getting this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 ratings out of 568454
>
> because if i run the exact code within the spark-shell, i got this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454
>
> *Why is it working fine within the spark-shell but it is not running
> fine programmatically  in the vmware image?*
>
> I am running the code using sbt-pack plugin to generate unix commands and
> run them within the vmware image which has the spark pseudocluster,
>
> This is the code i use to instantiate the sparkconf:
>
> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>
>                                    .setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")
>
>     val sc = new SparkContext(sparkConf)
>
>     val sqlContext = new SQLContext(sc)
>
>     val ssc = new StreamingContext(sparkConf, Seconds(2))
>
>     //this checkpointdir should be in a conf file, for now it is hardcoded!
>
>     val streamingCheckpointDir = "/home/cloudera/my-recommendation-spark-engine/checkpoint"
>
>     ssc.checkpoint(streamingCheckpointDir)
>
> I have tried to use this way of setting spark master, but an exception
> raises, i suspect that this is symptomatic of my problem.
> //.setMaster("spark://quickstart.cloudera:7077")
>
> The exception when i try to use the fully qualified domain name:
>
> .setMaster("spark://quickstart.cloudera:7077")
>
>
>
> java.io.IOException: Failed to connect to quickstart.cloudera/127.0.0.1:7077
>
>         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
>
>         at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
>
>         at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
>
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
>
>         at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.net.ConnectException: Connection refused: quickstart.cloudera/127.0.0.1:7077
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>
>         at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>
>         at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>
> I can ping to quickstart.cloudera in the cloudera terminal, so why i can't
> use .setMaster("spark://quickstart.cloudera:7077") instead of
> .setMaster("local[*]"):
>
> [cloudera@quickstart bin]$ ping quickstart.cloudera
>
> PING quickstart.cloudera (127.0.0.1) 56(84) bytes of data.
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=1 ttl=64 time=0.019 ms
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=2 ttl=64 time=0.026 ms
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=4 ttl=64 time=0.028 ms
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=5 ttl=64 time=0.026 ms
>
> 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=6 ttl=64 time=0.020 ms
>
> And the port 7077 is listening to incoming calls:
>
> [cloudera@quickstart bin]$ netstat -nap | grep 7077
>
> (Not all processes could be identified, non-owned process info
>
>  will not be shown, you would have to be root to see it all.)
>
> tcp        0      0 192.168.30.138:7077         0.0.0.0:*                   LISTEN
>
>
>
>
>
> [cloudera@quickstart bin]$ ping 192.168.30.138
>
> PING 192.168.30.138 (192.168.30.138) 56(84) bytes of data.
>
> 64 bytes from 192.168.30.138: icmp_seq=1 ttl=64 time=0.023 ms
>
> 64 bytes from 192.168.30.138: icmp_seq=2 ttl=64 time=0.026 ms
>
> 64 bytes from 192.168.30.138: icmp_seq=3 ttl=64 time=0.028 ms
>
> ^C
>
> --- 192.168.30.138 ping statistics ---
>
> 3 packets transmitted, 3 received, 0% packet loss, time 2810ms
>
> rtt min/avg/max/mdev = 0.023/0.025/0.028/0.006 ms
>
> [cloudera@quickstart bin]$ ifconfig
>
> eth2      Link encap:Ethernet  HWaddr 00:0C:29:6F:80:D2
>
>           inet addr:192.168.30.138  Bcast:192.168.30.255  Mask:255.255.255.0
>
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>
>           RX packets:8612 errors:0 dropped:0 overruns:0 frame:0
>
>           TX packets:8493 errors:0 dropped:0 overruns:0 carrier:0
>
>           collisions:0 txqueuelen:1000
>
>           RX bytes:2917515 (2.7 MiB)  TX bytes:849750 (829.8 KiB)
>
>
>
> lo        Link encap:Local Loopback
>
>           inet addr:127.0.0.1  Mask:255.0.0.0
>
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>
>           RX packets:57534 errors:0 dropped:0 overruns:0 frame:0
>
>           TX packets:57534 errors:0 dropped:0 overruns:0 carrier:0
>
>           collisions:0 txqueuelen:0
>
>           RX bytes:44440656 (42.3 MiB)  TX bytes:44440656 (42.3 MiB)
>
> I think that this must be a misconfiguration in a cloudera configuration
> file, but which one?
>
> Thank you very much for reading until here.
>
> *Alonso Isidoro Roman*
>
> about.me/alonso.isidoro.roman
>
>
> ------------------------------
>
> View this message in context: About a problem when mapping a file located
> within a HDFS vmware cdh-5.7 image
> <http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-when-mapping-a-file-located-within-a-HDFS-vmware-cdh-5-7-image-tp27058.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>