You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dmitry (Jira)" <ji...@apache.org> on 2022/06/27 21:12:00 UTC
[jira] [Created] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

|  ![](cid:jira-generated-image-avatar-382cb30a-69ae-479c-8934-df40eb9005dd) |
[Dmitry](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Litvintsev)
**created** an issue  
---|---  
|  
---  
|  [Spark](https://issues.apache.org/jira/browse/SPARK) / [![Bug](cid:jira-
generated-image-
avatar-05350c7b-ea88-4b06-9b30-bb20d0cb369d)](https://issues.apache.org/jira/browse/SPARK-39623)
[SPARK-39623](https://issues.apache.org/jira/browse/SPARK-39623)  
---  
[partitionng by datestamp leads to wrong query on
backend?](https://issues.apache.org/jira/browse/SPARK-39623)  
| Issue Type: |  ![Bug](cid:jira-generated-image-
avatar-05350c7b-ea88-4b06-9b30-bb20d0cb369d) Bug  
---|---  
Affects Versions: |  3.3.0  
Assignee: |  Unassigned  
Components: |  SQL  
Created: |  27/Jun/22 21:11  
Priority: |  ![Major](cid:jira-generated-image-static-
major-14ec3380-ed27-4abf-ad19-dc7ebe46b569) Major  
Reporter: |
[Dmitry](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Litvintsev)  
|

Hello,

I am new to Apache spark, so please bear with me. I would like to report what
seems to me a bug, but may be I am just not understanding something.

My goal is to run data analysis on a spark cluster. Data is stored in
PostgreSQL DB. Tables contained timestamped entries (timestamp with time
zone).

The code look like:



    
    
    
    from pyspark.sql import SparkSession
    
    spark = SparkSession \
            .builder \
            .appName("foo") \
            .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
            .getOrCreate()
    
    df = spark.read \
         .format("jdbc") \
         .option("url", "jdbc:postgresql://example.org:5432/postgres") \
         .option("dbtable", "billing") \
         .option("user", "user") \
         .option("driver", "org.postgresql.Driver") \
         .option("numPartitions", "4") \
         .option("partitionColumn", "datestamp") \
         .option("lowerBound", "2022-01-01 00:00:00") \
         .option("upperBound", "2022-06-26 23:59:59") \
         .option("fetchsize", 1000000) \
         .load()
    
    t0 = time.time()
    print("Number of entries is =====> ", df.count(), " Time to execute ", time.time()-t0)
    ...
    

datestamp is timestamp with time zone.

I see this query on DB backend:

```  
SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or
"datestamp" is null  
```  
The table is huge and entries go way back before  
 2022-01-02 11:59:59. So what ends up happening - all workers but one complete
and one remaining continues to process that query which, to me, looks like it
wants to get all the data before 2022-01-02 11:59:59. Which is not what I
intended.

I remedies this by changing to:

    
    
         .option("dbtable", "(select * from billinginfo where datestamp > '2022 01-01-01 00:00:00') as foo") \
    

And that seem to have solved the issue. But this seems kludgy. Am I doing
something wrong or there is a bug in the way partitioning queries are
generated?  
  
---  
|  |  [ ![Add Comment](cid:jira-generated-image-static-comment-
icon-602a1ccf-159f-451f-954b-8e7fa4474553)
](https://issues.apache.org/jira/browse/SPARK-39623#add-comment "Add Comment")
|  [Add Comment](https://issues.apache.org/jira/browse/SPARK-39623#add-comment
"Add Comment")  
---|---  
  
|  This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9) |  |
![Atlassian logo](https://issues.apache.org/jira/images/mail/atlassian-email-
logo.png)  
---