You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Mike Dias (JIRA)" <ji...@apache.org> on 2019/01/04 05:24:00 UTC

[jira] [Created] (SPARK-26532) repartitionByRange strategy reads source files twice

Mike Dias created SPARK-26532:
---------------------------------

             Summary: repartitionByRange strategy reads source files twice
                 Key: SPARK-26532
                 URL: https://issues.apache.org/jira/browse/SPARK-26532
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.4.0, 2.3.2
            Reporter: Mike Dias
         Attachments: repartition Stages.png, repartitionByRange Stages.png

When using repartitionByRange in Structured Stream API for reading then write files, it reads the source files twice. 

Example:
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartitionByRange(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start()
{code}
This execution creates 3 stages: 2 for reading and 1 for writing, reading the source twice. It's easy to see it in a large dataset where the reading process time is doubled.

 
{code:java}
$ curl -s -XGET http://localhost:4040/api/v1/applications/<shell_app_id>/stages
{code}
 

 

This is very different from the repartition strategy, which creates 2 stages: 1 for reading and 1 for writing.
{code:java}
val ds = spark.readStream.
  format("text").
  option("path", "data/streaming").
  load

val q = ds.
  repartition(10, $"value").
  writeStream.
  format("parquet").
  option("path", "/tmp/output").
  option("checkpointLocation", "/tmp/checkpoint").
  start(){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org