You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mike Dias (JIRA)" <ji...@apache.org> on 2019/01/04 05:24:00 UTC
[jira] [Created] (SPARK-26532) repartitionByRange strategy reads
source files twice
Mike Dias created SPARK-26532:
---------------------------------
Summary: repartitionByRange strategy reads source files twice
Key: SPARK-26532
URL: https://issues.apache.org/jira/browse/SPARK-26532
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 2.4.0, 2.3.2
Reporter: Mike Dias
Attachments: repartition Stages.png, repartitionByRange Stages.png
When using repartitionByRange in Structured Stream API for reading then write files, it reads the source files twice.
Example:
{code:java}
val ds = spark.readStream.
format("text").
option("path", "data/streaming").
load
val q = ds.
repartitionByRange(10, $"value").
writeStream.
format("parquet").
option("path", "/tmp/output").
option("checkpointLocation", "/tmp/checkpoint").
start()
{code}
This execution creates 3 stages: 2 for reading and 1 for writing, reading the source twice. It's easy to see it in a large dataset where the reading process time is doubled.
{code:java}
$ curl -s -XGET http://localhost:4040/api/v1/applications/<shell_app_id>/stages
{code}
This is very different from the repartition strategy, which creates 2 stages: 1 for reading and 1 for writing.
{code:java}
val ds = spark.readStream.
format("text").
option("path", "data/streaming").
load
val q = ds.
repartition(10, $"value").
writeStream.
format("parquet").
option("path", "/tmp/output").
option("checkpointLocation", "/tmp/checkpoint").
start(){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org