You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Frank Rosner (JIRA)" <ji...@apache.org> on 2017/10/17 09:27:00 UTC
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

    [ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207220#comment-16207220 ] 

Frank Rosner commented on SPARK-21551:
--------------------------------------

Do you guys mind if I backport this also to 2.0.x, 2.1.x, and 2.2.x? We are having some jobs that we don't want to upgrade to 2.3.0 but that are failing regularly because of this problem.

Which branches would that have to go to? branch-2.0, branch-2.1, and branch-2.2?

> pyspark's collect fails when getaddrinfo is too slow
> ----------------------------------------------------
>
>                 Key: SPARK-21551
>                 URL: https://issues.apache.org/jira/browse/SPARK-21551
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: peay
>            Assignee: peay
>            Priority: Critical
>             Fix For: 2.3.0
>
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and {{DataFrame.collect}} all work by starting an ephemeral server in the driver, and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
> {code}
> The server has **a hardcoded timeout of 3 seconds** (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) -- i.e., the Python process has 3 seconds to connect to it from the very moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
>     sock = None
>     # Support for both IPv4 and IPv6.
>     # On most of IPv6-ready systems, IPv6 will take precedence.
>     for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM):
>        .. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine only take a couple milliseconds, about 10% of them take between 2 and 10 seconds, leading to about 10% of jobs failing. I don't think we can always expect {{getaddrinfo}} to return instantly. More generally, Python may sometimes pause for a couple seconds, which may not leave enough time for the process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org