You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bruce Robbins (Jira)" <ji...@apache.org> on 2023/10/07 00:03:00 UTC

[jira] [Commented] (SPARK-45440) Incorrect summary counts from a CSV file

    [ https://issues.apache.org/jira/browse/SPARK-45440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772724#comment-17772724 ] 

Bruce Robbins commented on SPARK-45440:
---------------------------------------

I added {{inferSchema=true}} as a datasource option in your example and I got the expected answer. Otherwise it's doing a max and min on a string (not a number).

> Incorrect summary counts from a CSV file
> ----------------------------------------
>
>                 Key: SPARK-45440
>                 URL: https://issues.apache.org/jira/browse/SPARK-45440
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.5.0
>         Environment: Pyspark version 3.5.0 
>            Reporter: Evan Volgas
>            Priority: Major
>              Labels: aggregation, bug, pyspark
>
> I am using pip-installed Pyspark version 3.5.0 inside the context of an IPython shell. The task is straightforward: take [this CSV file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv] of AAPL stock prices and compute the minimum and maximum volume weighted average price for the entire file. 
> My code is [here. |https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've also performed the same computation in DuckDB because I noticed that the results of the Spark code are wrong. 
> Literally, the exact same SQL in DuckDB and in Spark yield different results, and Spark's are wrong. 
> I have never seen this behavior in a Spark release before. I'm very confused by it, and curious if anyone else can replicate this behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org