You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sid <fl...@gmail.com> on 2022/05/06 08:25:22 UTC

Count() action leading to errors | Pyspark

Hi Team,

I am trying to display the counts of the DF which is created by running one
Spark SQL query with a CTE pattern.

Everything is working as expected. I was able to write the DF to Postgres
RDS. However, when I am trying to display the counts using a simple count()
action it leads to the below error:

py4j.protocol.Py4JJavaError: An error occurred while calling o321.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in
stage 35.0 (TID 7889, 10.100.6.148, executor 1):
java.io.FileNotFoundException: File not present on S3
It is possible the underlying files have been updated. You can explicitly
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
in SQL or by recreating the Dataset/DataFrame involved.


So, I tried something like the below:

print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count())

So, there are 80 partitions being formed for this DF, and the count written
in Table is 92,665. However, it didn't match with the count displayed post
repartitioning which was 91,183

Not sure why is this gap?

Why the counts are not matching? Also what could be the possible reason for
that simple count error?

Environment:
AWS GLUE 1.X
10 workers
Spark 2.4.3

Thanks,
Sid

Re: Count() action leading to errors | Pyspark

Posted by Bjørn Jørgensen <bj...@gmail.com>.

Try without using CTE.

SQL CTE is temporary, so you are probably working on 2 datasets.

fre. 6. mai 2022 kl. 10:32 skrev Sid <fl...@gmail.com>:

> Hi Team,
>
> I am trying to display the counts of the DF which is created by running
> one Spark SQL query with a CTE pattern.
>
> Everything is working as expected. I was able to write the DF to Postgres
> RDS. However, when I am trying to display the counts using a simple count()
> action it leads to the below error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o321.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in
> stage 35.0 (TID 7889, 10.100.6.148, executor 1):
> java.io.FileNotFoundException: File not present on S3
> It is possible the underlying files have been updated. You can explicitly
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
> in SQL or by recreating the Dataset/DataFrame involved.
>
>
> So, I tried something like the below:
>
>
> print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count())
>
> So, there are 80 partitions being formed for this DF, and the count
> written in Table is 92,665. However, it didn't match with the count
> displayed post repartitioning which was 91,183
>
> Not sure why is this gap?
>
> Why the counts are not matching? Also what could be the possible reason
> for that simple count error?
>
> Environment:
> AWS GLUE 1.X
> 10 workers
> Spark 2.4.3
>
> Thanks,
> Sid
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297