You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ABHISHEK CHOUDHARY (JIRA)" <ji...@apache.org> on 2015/09/03 21:16:45 UTC
[jira] [Commented] (SPARK-10189) python rdd socket connection
problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729618#comment-14729618 ]
ABHISHEK CHOUDHARY commented on SPARK-10189:
--------------------------------------------
Well the problem was actually with Java Version.
pyspark is raising socket connection problem while using Java 1.8.
I tried with java 1.7 and its working fine.
> python rdd socket connection problem
> ------------------------------------
>
> Key: SPARK-10189
> URL: https://issues.apache.org/jira/browse/SPARK-10189
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.4.1
> Reporter: ABHISHEK CHOUDHARY
> Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1277, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
> File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py", line 898, in runJob
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> File "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", line 138, in _load_from_socket
> raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> try:
> sock = socket.socket(af, socktype, proto)
> sock.settimeout(3)
> sock.connect(sa)
> except socket.error:
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> try:
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org