You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/01/07 04:07:00 UTC
[jira] [Resolved] (SPARK-26532) repartitionByRange reads source
files twice
[ https://issues.apache.org/jira/browse/SPARK-26532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26532.
----------------------------------
Resolution: Not A Problem
> repartitionByRange reads source files twice
> -------------------------------------------
>
> Key: SPARK-26532
> URL: https://issues.apache.org/jira/browse/SPARK-26532
> Project: Spark
> Issue Type: Bug
> Components: SQL, Structured Streaming
> Affects Versions: 2.3.2, 2.4.0
> Reporter: Mike Dias
> Priority: Minor
> Attachments: repartition Stages.png, repartitionByRange Stages.png
>
>
> When using repartitionByRange in Structured Stream API for reading then write files, it reads the source files twice.
> Example:
> {code:java}
> val ds = spark.readStream.
> format("text").
> option("path", "data/streaming").
> load
> val q = ds.
> repartitionByRange(10, $"value").
> writeStream.
> format("parquet").
> option("path", "/tmp/output").
> option("checkpointLocation", "/tmp/checkpoint").
> start()
> {code}
> This execution creates 3 stages: 2 for reading and 1 for writing, reading the source twice. It's easy to see it in a large dataset where the reading process time is doubled.
>
> {code:java}
> $ curl -s -XGET http://localhost:4040/api/v1/applications/<shell_app_id>/stages
> {code}
>
>
> This is very different from the repartition strategy, which creates 2 stages: 1 for reading and 1 for writing.
> {code:java}
> val ds = spark.readStream.
> format("text").
> option("path", "data/streaming").
> load
> val q = ds.
> repartition(10, $"value").
> writeStream.
> format("parquet").
> option("path", "/tmp/output").
> option("checkpointLocation", "/tmp/checkpoint").
> start(){code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org