You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/07/05 02:53:00 UTC

[jira] [Commented] (SPARK-39605) PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS

    [ https://issues.apache.org/jira/browse/SPARK-39605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562330#comment-17562330 ] 

Hyukjin Kwon commented on SPARK-39605:
--------------------------------------

The exception is from MongoDB. I suspect this is a problem from that connector. It would be great to see where/what's the issue from Apache Spark.

> PySpark df.count() operation works fine on DBR 7.3 LTS but fails in DBR 10.4 LTS
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-39605
>                 URL: https://issues.apache.org/jira/browse/SPARK-39605
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Manoj Chandrashekar
>            Priority: Major
>         Attachments: image-2022-06-27-11-00-50-119.png
>
>
> I have a job that infers schema from mongodb and does operations such as flattening and unwinding because there are nested fields. After performing various transformations, finally when the count() is performed, it works perfectly fine in databricks runtime version 7.3 LTS but fails to perform the same in 10.4 LTS.
> *Below is the image that shows successful run in 7.3 LTS:*
> !https://docs.microsoft.com/answers/storage/attachments/215035-image.png|width=630,height=75!
> *Below is the image that shows failure in 10.4 LTS:*
> !image-2022-06-27-11-00-50-119.png|width=624,height=64!
> And I have validated that there is no field in our schema that has NullType. In fact when the schema was inferred, there were Null & void type fields which were converted to string using my UDF. This issue will persists even when I infer schema on complete dataset, that is, samplePoolSize is on full data set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org