You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dmitry (Jira)" <ji...@apache.org> on 2022/06/27 21:12:00 UTC
[jira] [Created] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?
| ![](cid:jira-generated-image-avatar-382cb30a-69ae-479c-8934-df40eb9005dd) |
[Dmitry](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Litvintsev)
**created** an issue
---|---
|
---
| [Spark](https://issues.apache.org/jira/browse/SPARK) / [![Bug](cid:jira-
generated-image-
avatar-05350c7b-ea88-4b06-9b30-bb20d0cb369d)](https://issues.apache.org/jira/browse/SPARK-39623)
[SPARK-39623](https://issues.apache.org/jira/browse/SPARK-39623)
---
[partitionng by datestamp leads to wrong query on
backend?](https://issues.apache.org/jira/browse/SPARK-39623)
| Issue Type: | ![Bug](cid:jira-generated-image-
avatar-05350c7b-ea88-4b06-9b30-bb20d0cb369d) Bug
---|---
Affects Versions: | 3.3.0
Assignee: | Unassigned
Components: | SQL
Created: | 27/Jun/22 21:11
Priority: | ![Major](cid:jira-generated-image-static-
major-14ec3380-ed27-4abf-ad19-dc7ebe46b569) Major
Reporter: |
[Dmitry](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=Litvintsev)
|
Hello,
I am new to Apache spark, so please bear with me. I would like to report what
seems to me a bug, but may be I am just not understanding something.
My goal is to run data analysis on a spark cluster. Data is stored in
PostgreSQL DB. Tables contained timestamped entries (timestamp with time
zone).
The code look like:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("foo") \
.config("spark.jars", "/opt/postgresql-42.4.0.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://example.org:5432/postgres") \
.option("dbtable", "billing") \
.option("user", "user") \
.option("driver", "org.postgresql.Driver") \
.option("numPartitions", "4") \
.option("partitionColumn", "datestamp") \
.option("lowerBound", "2022-01-01 00:00:00") \
.option("upperBound", "2022-06-26 23:59:59") \
.option("fetchsize", 1000000) \
.load()
t0 = time.time()
print("Number of entries is =====> ", df.count(), " Time to execute ", time.time()-t0)
...
datestamp is timestamp with time zone.
I see this query on DB backend:
```
SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or
"datestamp" is null
```
The table is huge and entries go way back before
2022-01-02 11:59:59. So what ends up happening - all workers but one complete
and one remaining continues to process that query which, to me, looks like it
wants to get all the data before 2022-01-02 11:59:59. Which is not what I
intended.
I remedies this by changing to:
.option("dbtable", "(select * from billinginfo where datestamp > '2022 01-01-01 00:00:00') as foo") \
And that seem to have solved the issue. But this seems kludgy. Am I doing
something wrong or there is a bug in the way partitioning queries are
generated?
---
| | [ ![Add Comment](cid:jira-generated-image-static-comment-
icon-602a1ccf-159f-451f-954b-8e7fa4474553)
](https://issues.apache.org/jira/browse/SPARK-39623#add-comment "Add Comment")
| [Add Comment](https://issues.apache.org/jira/browse/SPARK-39623#add-comment
"Add Comment")
---|---
| This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9) | |
![Atlassian logo](https://issues.apache.org/jira/images/mail/atlassian-email-
logo.png)
---