You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alex Veale (Jira)" <ji...@apache.org> on 2022/04/11 09:56:00 UTC

[jira] [Updated] (SPARK-38858) PythonException - socke.timeout: timed out - socket.py line 707

     [ https://issues.apache.org/jira/browse/SPARK-38858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Veale updated SPARK-38858:
-------------------------------
    Attachment: socketpy.png
                socketError - timed out.png

> PythonException - socke.timeout: timed out - socket.py line 707
> ---------------------------------------------------------------
>
>                 Key: SPARK-38858
>                 URL: https://issues.apache.org/jira/browse/SPARK-38858
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.2.1
>         Environment: Intel i7 core
> 64Gb ram ( 30Gb assigned to spark executor memory)
> 4 cores
> Windows 11
>            Reporter: Alex Veale
>            Priority: Major
>              Labels: test
>         Attachments: socketError - timed out.png, socketpy.png
>
>
> I have a database of about 8 million residential addresses address. I perform 3 separate cleaning operations on the data using udf's and regular expressions (python re package). Then I create an additional column by splitting the 'cleaned' address by commas and then taking the object in the last index as the suburb and use this column as a key to joining the original data frame to a supplementary 1 which contains suburb and country pairs, joining on suburb and then finally create another column containing the final address with the 'unsplit clean' address column concatenated with the country column pulled in by the join. 
> When I try to display the result by calling show, I get the desired result if I show only the first 1000 records or less, however if I try to show more records or I add an additional filter to only display records that have been modified, I get a socket timeout error.
> I have tried to increase the socket's send and receive buffer sizes to the maximum of 1048576 bytes, as well as increasing the spark executor heartbeat interval (7200s )as well as the spark network timeout (3600s); and I have tried repartitioning the data to 16 and 32 partitions, all of which have had no impact on the result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org