You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ABHISHEK CHOUDHARY (JIRA)" <ji...@apache.org> on 2015/09/03 21:16:45 UTC
[jira] [Commented] (SPARK-10189) python rdd socket connection problem

    [ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729618#comment-14729618 ] 

ABHISHEK CHOUDHARY commented on SPARK-10189:
--------------------------------------------

Well the problem was actually with Java Version.
pyspark is raising socket connection problem while using Java 1.8.
I tried with java 1.7 and its working fine.


> python rdd socket connection problem
> ------------------------------------
>
>                 Key: SPARK-10189
>                 URL: https://issues.apache.org/jira/browse/SPARK-10189
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.1
>            Reporter: ABHISHEK CHOUDHARY
>              Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1277, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py", line 898, in runJob
>     return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
>   File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", line 138, in _load_from_socket
>     raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
>     at java.net.PlainSocketImpl.socketAccept(Native Method)
>     at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
>     at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>     at java.net.ServerSocket.accept(ServerSocket.java:513)
>     at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
>     sock = None
>     # Support for both IPv4 and IPv6.
>     # On most of IPv6-ready systems, IPv6 will take precedence.
>     for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM):
>         af, socktype, proto, canonname, sa = res
>         try:
>             sock = socket.socket(af, socktype, proto)
>             sock.settimeout(3)
>             sock.connect(sa)
>         except socket.error:
>             sock = None
>             continue
>         break
>     if not sock:
>         raise Exception("could not open socket")
>     try:
>         rf = sock.makefile("rb", 65536)
>         for item in serializer.load_stream(rf):
>             yield item
>     finally:
>         sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org