You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/23 08:12:04 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

HeartSaVioR opened a new pull request #28904:
URL: https://github.com/apache/spark/pull/28904


   ### What changes were proposed in this pull request?
   
   In many operations on CompactibleFileStreamLog reads a metadata log file and materializes all entries into memory. As the nature of the compact operation, CompactibleFileStreamLog may have a huge compact log file with bunch of entries included, and for now they're just monotonically increasing, which means the amount of memory to materialize also grows incrementally. This leads pressure on GC.
   
   This patch proposes to streamline the logic on file stream source and sink whenever possible to avoid memory issue. To make this possible we have to break the existing behavior of excluding entries - now the `compactLogs` method is called with all entries, which forces us to materialize all entries into memory. This is hopefully no effect on end users, because only file stream sink has a condition to exclude entries, and the condition has been never true. (DELETE_ACTION has been never set.)
   
   Based on the observation, this patch also changes the existing UT a bit which simulates the situation where "A" file is added, and another batch marks the "A" file as deleted. This situation simply doesn't work with the change, but as I mentioned earlier it hasn't been used. (I'm not sure the UT is from the actual run. I guess not.)
   
   ### Why are the changes needed?
   
   The memory issue (OOME) is reported by both JIRA issue and user mailing list.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   * Existing UTs
   * Manual test done
   
   The manual test leverages the simple apps which continuously writes the file stream sink metadata log.
   
   https://github.com/HeartSaVioR/spark-delegation-token-experiment/commit/bea7680e4c588f455f8c3181a96c9eff5002fa1a
   
   The test is configured to have a batch metadata log file at 1.9M (10,000 entries) whereas other Spark configuration is set to the default. (compact interval = 10) The app runs as driver, and the heap memory on driver is set to 3g.
   
   > before the patch
   
   <img width="1094" alt="Screen Shot 2020-06-23 at 3 37 44 PM" src="https://user-images.githubusercontent.com/1317309/85375841-d94f3480-b571-11ea-817b-c6b48b34888a.png">
   
   It only ran for 40 mins, with the latest compact batch file size as 1.3G. The process struggled with GC, and after some struggling, it threw OOME.
   
   > after the patch
   
   <img width="1094" alt="Screen Shot 2020-06-23 at 3 53 29 PM" src="https://user-images.githubusercontent.com/1317309/85375901-eff58b80-b571-11ea-837e-30d107f677f9.png">
   
   It sustained 2 hours run, with the latest compact batch file size as 2.2G. The actual memory usage didn't even go up to 1.2G, and be cleaned up soon without outstanding GC activity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659210920






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.
   
   Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal. 
   
   There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
   (Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246)
   
   My goal of the overall improvements in file stream source/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files without losing anything, I proposed retention of the output files so that older files can be excluded in metadata log but they no longer be accessible. End users need to pick up alternatives once they cannot live with the limitations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652825181


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124870/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654325185


   **[Test build #125090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125090/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648122404


   **[Test build #124403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124403/testReport)** for PR 28904 at commit [`b00e262`](https://github.com/apache/spark/commit/b00e2621b448fc810447b7c2ed9c99d8e60c96e7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656391366






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656408471


   **[Test build #125521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125521/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-657000679


   **[Test build #125647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125647/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647986027


   **[Test build #124403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124403/testReport)** for PR 28904 at commit [`b00e262`](https://github.com/apache/spark/commit/b00e2621b448fc810447b7c2ed9c99d8e60c96e7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654468523


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125127/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658519932






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652972481


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124877/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655252709


   **[Test build #125249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125249/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671108824






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.
   
   Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal. 
   
   There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
   (Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246 and my reply https://github.com/apache/spark/pull/27694#issuecomment-651479578)
   
   My goal of the overall improvements in file stream source/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files without losing anything, I proposed retention of the output files so that older files can be excluded in metadata log but they no longer be accessible. End users need to pick up alternatives once they cannot live with the limitations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647996061






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652136642






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656584135






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647996040


   **[Test build #124408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124408/testReport)** for PR 28904 at commit [`7721fff`](https://github.com/apache/spark/commit/7721fffb88cfafcbc25a15587d9365068650fca4).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652825177


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652138443






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653957804






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653948039


   **[Test build #124973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124973/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659210212


   **[Test build #125956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125956/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656618115


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125596/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-657000952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652247425


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648147885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671108824






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655949341






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656391366






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658984717


   **[Test build #125898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125898/testReport)** for PR 28904 at commit [`006c028`](https://github.com/apache/spark/commit/006c028cf917bcfa4e78955a280e815418cbc2be).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647994188






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652802682


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671084588


   **[Test build #127241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127241/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647995522


   cc. @tdas @zsxwing @jose-torres @gaborgsomogyi


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r448070885



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,15 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    log.action != FileStreamSinkLog.DELETE_ACTION

Review comment:
       Just think out loud, DELETE_ACTION is not needed even we consider data compaction, if we do data compaction in compact batch. Nevertheless, let's focus on current state for now. DELETE_ACTION can be dropped.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647995916


   also cc. @xuanyuanking @uncleGen as they have been reviewed sibling PRs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653948129






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-678059772


   Thanks all for reviewing and merging!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468305462



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -132,12 +130,18 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     }
   }
 
+  def serializeEntry(entry: T, out: OutputStream): Unit = {
+    out.write(Serialization.write(entry).getBytes(UTF_8))
+  }
+
+  def deserializeEntry(line: String): T = Serialization.read[T](line)

Review comment:
       private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647996073


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124408/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648123665






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652175830


   **[Test build #124716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124716/testReport)** for PR 28904 at commit [`849e421`](https://github.com/apache/spark/commit/849e421ae0ef2f2b1b98f1aa6517ec73f56cda15).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656583137


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656801641


   **[Test build #125603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125603/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654531446


   **[Test build #125152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125152/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658598230


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653957870






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652829555


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671084588


   **[Test build #127241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127241/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659210920






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652245843


   **[Test build #124719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124719/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656391116


   **[Test build #125521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125521/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654091567


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r444051029



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -222,21 +256,22 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
         try {
           val logs =
             getAllValidBatches(latestId, compactInterval).flatMap { id =>
-              super.get(id).getOrElse {
+              filterInBatch(id)(shouldRetain).getOrElse {
                 throw new IllegalStateException(
                   s"${batchIdToPath(id)} doesn't exist " +
                     s"(latestId: $latestId, compactInterval: $compactInterval)")
               }
             }
-          return compactLogs(logs).toArray
+          return logs.toArray
         } catch {
           case e: IOException =>
             // Another process using `CompactibleFileStreamLog` may delete the batch files when
             // `StreamFileIndex` are reading. However, it only happens when a compaction is
             // deleting old files. If so, let's try the next compaction batch and we should find it.
             // Otherwise, this is a real IO issue and we should throw it.
-            latestId = nextCompactionBatchId(latestId, compactInterval)
-            super.get(latestId).getOrElse {
+            val expectedMinLatestId = nextCompactionBatchId(latestId, compactInterval)

Review comment:
       This new approach is to avoid reading the next compact file log, which materializes all entries into the file. It should be extreme case, so it's also OK to keep this as it is if someone strongly think the previous one is better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654580604


   **[Test build #125152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125152/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652970943


   **[Test build #124877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124877/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654325946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r452136329



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true

Review comment:
       How about we still keep the `DELETE_ACTION` here?
   ```
   if (log.action == FileStreamSinkLog.DELETE_ACTION) {
     false
   } else {
     true
   }
   ```
   I also noticed that it's not implemented now, but it seems the only way to delete logs when compacting. I think it's also the original design goals. cc @zsxwing who is the author for SPARK-14678 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655089674


   **[Test build #125249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125249/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654644092


   **[Test build #125172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125172/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652138181


   **[Test build #124719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124719/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656408644






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654645258


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125172/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654585071






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656803694






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659143085


   **[Test build #125933 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125933/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r456729588



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       Is there a specific reason to avoid changing the method which is private API? I'd say it'd be less clear to provide the intention on downstream (technically we shouldn't concern with downstream though) - if you don't change the method signature but just change the semantic, no one would know about the change and it will be no-op for downstream.
   
   I'll consider it if you have strong reason to do so, otherwise let's hear other's voice as well. cc. @zsxwing 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675297184


   **[Test build #127530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127530/testReport)** for PR 28904 at commit [`e16ebe4`](https://github.com/apache/spark/commit/e16ebe4e530d3c44bb0ba39981c4ec2287c3589e).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-650889370


   UPDATE: SPARK-30946 + SPARK-30462 with lower down driver memory to 1.5G now writes batch 9039 which RES is around 1.3g. I guess the process uses up available memory if possible, but not leads to OOME.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659208890


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468247295



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception

Review comment:
       Shall we mention `IllegalStateException` explicitly because the caller should know what kind of exception it is?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647997923






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658860443






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-667459563


   cc. @tdas @zsxwing @jose-torres kindly reminder 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468249419



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       @HeartSaVioR . Can we have `def shouldRetain(log: T): Boolean = true` to simplify?
   `true` is reasonable for the default value for `Log`, isn't it? And, in this PR, all derived class overrides with `true` only.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654258449


   **[Test build #125053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125053/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654264713


   **[Test build #125090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125090/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654262269


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654259602


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125053/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652831471






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654441306






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652136642






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658519932






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658519488


   **[Test build #125875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125875/testReport)** for PR 28904 at commit [`006c028`](https://github.com/apache/spark/commit/006c028cf917bcfa4e78955a280e815418cbc2be).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655948736


   **[Test build #125445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125445/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659088157






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654093839






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653935583


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124945/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671084734






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653932363






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r471886440



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.

Review comment:
       Thanks for the suggestion. Will file an issue and add JIRA number into the code comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-661617632


   cc. @tdas @zsxwing @jose-torres as it's technically not possible to merge only from @xuanyuanking 's review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654531766






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659149320


   Technically it's a private API, even not tagged as developer API - that said, it doesn't break anything in Spark's perspective. If we have confusion with availability of `org.apache.spark.sql.execution` package outside of Spark, then I'd rather say we may need to reconsider adding `private[execution]` on everywhere in the package.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648455941


   Combining efforts together:
   
   * master branch + this patch (SPARK-30462) = batch 919 in 2 hours (120 mins)
   * SPARK-30946 + this patch (SPARK-30462) = batch 919 in 11 mins
   
   10x speedup, reduction of log file size (missed the batch 919 but 1369.compact only took 678M), consistent memory usage (under 1.2G)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653932363






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656624881


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647994188






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654259591


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675298123


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654457204






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647993429


   **[Test build #124408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124408/testReport)** for PR 28904 at commit [`7721fff`](https://github.com/apache/spark/commit/7721fffb88cfafcbc25a15587d9365068650fca4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468242231



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.

Review comment:
       Could you file a JIRA and make it IDed TODO, @HeartSaVioR ?
   That will improve the visibility on `DELETE_ACTION` related issues.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r448051939



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,67 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFouncException if `throwOnNonExist` is true. If false, the method will just return.
+   */
+  def foreachInBatch(batchId: Long, throwOnNonExist: Boolean)(fn: T => Unit): Unit = {

Review comment:
       I had a caller leveraging this, but while working it seems to be removed. Let's make it simple.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655090318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652804349


   **[Test build #124870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124870/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648147885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648000807


   **[Test build #124410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124410/testReport)** for PR 28904 at commit [`a2daf01`](https://github.com/apache/spark/commit/a2daf01c5eff591430bad72c4c51f921955df420).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-673221973


   Thanks for the review comments. Please allow some more days (probably next week) to address as I'm on vacation this week and have limited time on looking into laptop. :) Thanks again!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654452838


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659088157






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468242231



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.

Review comment:
       Could you file a JIRA and make it IDed TODO, @HeartSaVioR ?
   That will improve the visibility on `DELETE_ACTION` related issues. That will lead a follow-up discussion on `DELETE_ACTION` after this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654325956


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125090/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654264713


   **[Test build #125090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125090/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652825177






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653935579






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659204179


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125933/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652804349


   **[Test build #124870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124870/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654440852


   **[Test build #125123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125123/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656803694


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656927344


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654584036


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652138181


   **[Test build #124719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124719/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655090318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653959224


   **[Test build #124985 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124985/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654459741


   **[Test build #125127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125127/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653948129






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655253940


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125249/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r455092782



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true

Review comment:
       Make sense. I also did some investigation and searched other thrid-party libraries, seems the DELETE_ACTION is not used. I think it's OK to design new behavior when we need it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652825070


   **[Test build #124870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124870/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675298132


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127530/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655948736


   **[Test build #125445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125445/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656803707


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125603/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652176078


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124716/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655089674


   **[Test build #125249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125249/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654440852


   **[Test build #125123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125123/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656928129






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654581449


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125152/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659372181






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-671084734






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654325946


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656586083


   **[Test build #125596 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125596/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652804855






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654457204






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.
   
   Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal. 
   
   There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
   (Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246)
   
   My goal of the overall improvements in file stream souce/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files, I proposed retention of the output files so that older files can be excluded in metadata log.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653931916


   **[Test build #124945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124945/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653957873


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124973/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654259591






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652136149


   **[Test build #124716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124716/testReport)** for PR 28904 at commit [`849e421`](https://github.com/apache/spark/commit/849e421ae0ef2f2b1b98f1aa6517ec73f56cda15).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654054397


   **[Test build #124985 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124985/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654585071






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656471539


   retest this, please
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.
   
   Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal. 
   
   There're already lots of efforts being made, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
   (Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246)
   
   My goal of the overall improvements in file stream souce/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files, I proposed retention of the output files so that older files can be excluded in metadata log.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655949341






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654459741


   **[Test build #125127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125127/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658984717


   **[Test build #125898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125898/testReport)** for PR 28904 at commit [`006c028`](https://github.com/apache/spark/commit/006c028cf917bcfa4e78955a280e815418cbc2be).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647997923






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468247295



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception

Review comment:
       Shall we mention `IllegalStateException` explicitly because the caller should know what kind of exception it is? Or, you may want to mention `IllegalStateException` at line 182 in a similar way to `FileNotFoundException`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r447677532



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,67 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFouncException if `throwOnNonExist` is true. If false, the method will just return.

Review comment:
       typo: FileNotFoundException

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
##########
@@ -165,6 +165,63 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
     }
   }
 
+  /**
+   * Apply provided function to each entry in the specific batch metadata log.
+   *
+   * Unlike get which will materialize all entries into memory, this method streamlines the process
+   * via READ-AND-PROCESS. This helps to avoid the memory issue on huge metadata log file.
+   *
+   * NOTE: This no longer fails early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.
+   */
+  def applyFnToBatchByStream[RET](batchId: Long)(fn: InputStream => RET): RET = {

Review comment:
       nit: duplicated code with `get`.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,67 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFouncException if `throwOnNonExist` is true. If false, the method will just return.
+   */
+  def foreachInBatch(batchId: Long, throwOnNonExist: Boolean)(fn: T => Unit): Unit = {

Review comment:
       What kind of scenario do we need `throwOnNonExist = false`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
##########
@@ -165,6 +165,63 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
     }
   }
 
+  /**
+   * Apply provided function to each entry in the specific batch metadata log.
+   *
+   * Unlike get which will materialize all entries into memory, this method streamlines the process
+   * via READ-AND-PROCESS. This helps to avoid the memory issue on huge metadata log file.
+   *
+   * NOTE: This no longer fails early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.
+   */
+  def applyFnToBatchByStream[RET](batchId: Long)(fn: InputStream => RET): RET = {
+    val batchMetadataFile = batchIdToPath(batchId)
+    if (fileManager.exists(batchMetadataFile)) {
+      val input = fileManager.open(batchMetadataFile)
+      try {
+        fn(input)
+      } catch {
+        case ise: IllegalStateException =>
+          // re-throw the exception with the log file path added
+          throw new IllegalStateException(
+            s"Failed to read log file $batchMetadataFile. ${ise.getMessage}", ise)
+      } finally {
+        IOUtils.closeQuietly(input)
+      }
+    } else {
+      throw new FileNotFoundException(s"Unable to find batch $batchMetadataFile")
+    }
+  }
+
+  /**
+   * Store the metadata for the specified batchId and return `true` if successful. This method
+   * fills the content of metadata via executing function. If the function throws exception,
+   * writing will be automatically cancelled and this method will propagate the exception.
+   *
+   * If the batchId's metadata has already been stored, this method will return `false`.
+   */
+  def addNewBatchByStream(batchId: Long)(fn: OutputStream => Unit): Boolean = {

Review comment:
       ditto, duplicate code with writeBatchToFile

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       Little confused about this change. Seems the refactor of compactLogs and `DELETE_ACTION` didn't relate to the memory issue fixing? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655253929


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656158267


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125445/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r471878410



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       We'll enable the condition on file stream source and sink via #28363 and #28422 so eventually majority of derived classes will override the method. But I also agree `true` is the reasonable default value. I'll update this. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r444050433



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -222,21 +256,22 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
         try {
           val logs =
             getAllValidBatches(latestId, compactInterval).flatMap { id =>
-              super.get(id).getOrElse {
+              filterInBatch(id)(shouldRetain).getOrElse {

Review comment:
       This would help when we introduce a new condition on exclusion of entries.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,15 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    log.action != FileStreamSinkLog.DELETE_ACTION

Review comment:
       While I just keep this, I think we should just remove this. As I left TODO below, it hasn't been used, exists hypothetically.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -212,7 +246,7 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   /**
    * Returns all files except the deleted ones.
    */
-  def allFiles(): Array[T] = {
+  def allFiles(predicate: T => Boolean = _ => true): Array[T] = {

Review comment:
       We can also have a streamlined version of this method to avoid materializing all entries on initializing FileStreamSource, though I think there's the another problem we should solve (file stream source log should not have bunch of entries - think about other data sources) and once we fixed that issue it won't matter at all.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       I just retained the functionality of exclusion, as we still have a chance to apply retention which is applied per entry.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -222,21 +256,22 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
         try {
           val logs =
             getAllValidBatches(latestId, compactInterval).flatMap { id =>
-              super.get(id).getOrElse {
+              filterInBatch(id)(shouldRetain).getOrElse {
                 throw new IllegalStateException(
                   s"${batchIdToPath(id)} doesn't exist " +
                     s"(latestId: $latestId, compactInterval: $compactInterval)")
               }
             }
-          return compactLogs(logs).toArray
+          return logs.toArray
         } catch {
           case e: IOException =>
             // Another process using `CompactibleFileStreamLog` may delete the batch files when
             // `StreamFileIndex` are reading. However, it only happens when a compaction is
             // deleting old files. If so, let's try the next compaction batch and we should find it.
             // Otherwise, this is a real IO issue and we should throw it.
-            latestId = nextCompactionBatchId(latestId, compactInterval)
-            super.get(latestId).getOrElse {
+            val expectedMinLatestId = nextCompactionBatchId(latestId, compactInterval)

Review comment:
       This new approach is to avoid reading the next compact file log, which materializes all entries into the file.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,15 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    log.action != FileStreamSinkLog.DELETE_ACTION
   }
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.
+  // TODO: We can remove the field "action" as well, ignoring "action" in existing metadata log.

Review comment:
       Note that this would also help to reduce the size of each entry. OK to leave only ADD_ACTION if JSON serializer/deserializer complains.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r456423370



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       I take some time looking into the detail. Seems like we can use `Stream` here, which can delay the evaluation for each file status. Then we can both keep the API and have the improvement together. But for using `Stream[Int]` here, we'll have a restriction for the implementation of `compactLogs`, which also needs to return `Stream[Int]`, otherwise, the improvement of streamline will not take effect.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653948039


   **[Test build #124973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124973/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-650684827


   Very impressive! I'll review this in 2 days.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647993429


   **[Test build #124408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124408/testReport)** for PR 28904 at commit [`7721fff`](https://github.com/apache/spark/commit/7721fffb88cfafcbc25a15587d9365068650fca4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654056598


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124985/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656618111






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652247430


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124719/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654531766






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658586787


   **[Test build #125875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125875/testReport)** for PR 28904 at commit [`006c028`](https://github.com/apache/spark/commit/006c028cf917bcfa4e78955a280e815418cbc2be).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r452217417



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true

Review comment:
       Please note that after the patch we are no longer able to match ADD_ACTION and DELETE_ACTION in compaction, so that's the responsibility of "reader" to exclude entities which have both ADD_ACTION and DELETE_ACTION, even we take DELETE_ACTION into account.
   
   Returning `false` for DELETE_ACTION here simply drops the entities for DELETE_ACTION without dropping ADD_ACTION, hence some files would be resurrected on the reader side after compaction, making problems.
   
   DELETE_ACTION is introduced years ago and we still haven't touched it. I think we should just defer and shouldn't deal with the thing we don't have and also don't have a clear plan to address. Once we design the new behavior, we would know what we should do. Dealing with something according to the imagination blocks us to move forward.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656154418


   **[Test build #125445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125445/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666770668


   > for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
   
   I already provided the change (Spark 3.0.0 if I remember correctly), but this is only "best-effort". You cannot deal with crashing scenario.
   
   Given we directly write the file to the final path, worth noting that reading output files in directory directly doesn't only mean you're reading duplicated outputs (at least once). This also means there's a chance you may be reading incomplete/corrupted files as well, in some sort of crashing during write.
   
   Stepping back to explain the rationalization and the goal - providing holistic solution is not a goal. 
   
   There're already lots of efforts being made to provide holistic solution, though these efforts are happening "outside" of Spark codebase, Delta Lake, Apache Iceberg, Apache Hudi, and probably more. I'd just reinvent the wheel if I try to address the entire problems, which I can't persuade anyone to provide me enough time to work.
   (Please refer the comment which is the feedback I got - https://github.com/apache/spark/pull/27694#issuecomment-651454246)
   
   My goal of the overall improvements in file stream souce/sink is, enabling end users run the query a bit longer without weird issues (like OOM). I just had to take easier (and limited) approach to solve the issue. e.g. regarding growing entries issue, while alternatives support data compaction to reduce the overall number of files without losing anything, I proposed retention of the output files so that older files can be excluded in metadata log but they no longer be accessible.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659372181






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659203149


   **[Test build #125933 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125933/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647986027


   **[Test build #124403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124403/testReport)** for PR 28904 at commit [`b00e262`](https://github.com/apache/spark/commit/b00e2621b448fc810447b7c2ed9c99d8e60c96e7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675250393


   **[Test build #127530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127530/testReport)** for PR 28904 at commit [`e16ebe4`](https://github.com/apache/spark/commit/e16ebe4e530d3c44bb0ba39981c4ec2287c3589e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-647986664






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654584787


   **[Test build #125172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125172/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r449934740



##########
File path: sql/core/src/test/resources/structured-streaming/file-sink-log-version-2.1.0/8
##########
@@ -1,3 +1,2 @@
 v1
 {"path":"/a/b/8","size":800,"isDir":false,"modificationTime":800,"blockReplication":1,"blockSize":100,"action":"add"}
-{"path":"/a/b/0","size":100,"isDir":false,"modificationTime":100,"blockReplication":1,"blockSize":100,"action":"delete"}

Review comment:
       This should be removed - this gave false information as we support DELETE_ACTION previously, but in reality it was never implemented.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652972470






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656586083


   **[Test build #125596 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125596/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659141316


   `Well, I guess I already explained why compactLogs is the culprit of the memory issue, right? (#28904 (comment))`
   
   Yep that's right. I'm also looking at the code in detail and try to find a way both keep this API and have the improvement. If it's hard to achieve, of course the improvement has a higher priority. I'll take a closer look today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656626272






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675250393


   **[Test build #127530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127530/testReport)** for PR 28904 at commit [`e16ebe4`](https://github.com/apache/spark/commit/e16ebe4e530d3c44bb0ba39981c4ec2287c3589e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654452846


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125123/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652176070


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656584135






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itsvikramagr commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
itsvikramagr commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666248758


   @HeartSaVioR - This is a much-needed fix. Thanks for it.
   
   I have an orthogonal question. Why do we need to worry about compacting the file sink metadata? I can think of following reasons 
   -  the downstream read operations can read the compacted metadata file to list all committed files. So they can avoid the listing cost and also improve performance
   - Helps in exactly-once semantics. On task failures, we don't have to worry about deleting any files written. 
   
   If the compacted metadata file size is running into GBs, the number of valid files would be in millions. In practice, the end-user will consider this sink path as a staging location and have another job to compact these small files into a final destination. 
   
   for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
    
   In short, if we provide an option just to have last few commits in sink metadata to ensure SS is not impacted. And make changes in various readers not to read using metadata log files. Won't it help in ensuring the reliability of the streaming job?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654093148


   **[Test build #125053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125053/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654581445


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654468511






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653935579


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654265205






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653935559


   **[Test build #124945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124945/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654529866


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468306894



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
##########
@@ -160,8 +152,42 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
         IOUtils.closeQuietly(input)
       }
     } else {
-      logDebug(s"Unable to find batch $batchMetadataFile")
-      None
+      throw new FileNotFoundException(s"Unable to find batch $batchMetadataFile")
+    }
+  }
+
+  /**
+   * Store the metadata for the specified batchId and return `true` if successful. This method
+   * fills the content of metadata via executing function. If the function throws exception,

Review comment:
       Nit: exception -> an exception




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675248798






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655088167


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654584787


   **[Test build #125172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125172/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655947410


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648146041


   **[Test build #124410 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124410/testReport)** for PR 28904 at commit [`a2daf01`](https://github.com/apache/spark/commit/a2daf01c5eff591430bad72c4c51f921955df420).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656158252






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648000807


   **[Test build #124410 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124410/testReport)** for PR 28904 at commit [`a2daf01`](https://github.com/apache/spark/commit/a2daf01c5eff591430bad72c4c51f921955df420).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654056588






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656928129






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654093148


   **[Test build #125053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125053/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r448050453



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       The UT I had to remove leverages this behavior - it's only possible when we collect all entries, as it assumes DELETE_ACTION can come after ADD_ACTION "in any further batches" and compactLogs should be able to filter out. This also requires us to materialize all entries.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654581445






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468249419



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       @HeartSaVioR . Can we have `def shouldRetain(log: T): Boolean = true` to simplify?
   `true` is reasonable for the default value for `Log`, isn't it? And, in this PR, all derived class overrides with `true` only. The special derived classes will `override` in any way.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468249419



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       @HeartSaVioR . Can we have `def shouldRetain(log: T): Boolean = true` to simplify?
   `true` is reasonable for the default value for `Log`, isn't it? And, in this PR, all derived class overrides with `true` only. I believe that the special derived classes will `override` in any way according to their semantics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656618111


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653254475


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652247425






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468246803



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.

Review comment:
       Shall we mention `IllegalStateException` together explicitly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653957804






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654645251


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r448049735



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       `compactLogs` requires to pass "all entries" which should be loaded into memory, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-657000952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-653957750


   **[Test build #124973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124973/testReport)** for PR 28904 at commit [`11cb693`](https://github.com/apache/spark/commit/11cb6930104ab9f9aef3a471650504f2c18eb777).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468305223



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.

Review comment:
       How to ensure all the callers handle it properly? Do we need to introduce a test to cover it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659143085


   **[Test build #125933 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125933/testReport)** for PR 28904 at commit [`247a0a1`](https://github.com/apache/spark/commit/247a0a1259f7a87701b8213d33810d1c63ff1e7b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-659091597






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656625881


   **[Test build #125603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125603/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r471910255



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,15 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+    log.action != FileStreamSinkLog.DELETE_ACTION
   }
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.
+  // TODO: We can remove the field "action" as well, ignoring "action" in existing metadata log.

Review comment:
       Let's consider only when we are open to introduce another version. Don't want to deal with compatibility even without bumping version.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654531446


   **[Test build #125152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125152/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itsvikramagr edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
itsvikramagr edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666248758


   @HeartSaVioR - This is a much-needed fix. Thanks for it.
   
   I have an orthogonal question. Why do we need to worry about file sink metadata files? I can think of the following reasons 
   -  the downstream read operations can read the compacted metadata file to list all committed files. So they can avoid the listing cost and also improve performance
   - Helps in exactly-once semantics. On task failures, we don't have to worry about deleting any files written. 
   
   If the compacted metadata file size is running into GBs, the number of valid files would be in millions. In practice, the end-user will consider this sink path as a staging location and have another job to compact these small files into a final destination. 
   
   for exactly-once semantics, we can add make changes in ManifestFileCommitter to delete files in the abort function. Or we can come up with some other alternatives.
    
   In short, if we provide an option just to have last few commits in sink metadata to ensure SS is not impacted. And make changes in various readers not to read using metadata log files. Won't it help in ensuring the reliability of the streaming job?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652138443






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r471883048



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception

Review comment:
       `IllegalStateException` is not the only exception this method can throw. It can also throw `IOException` as it deals with file, and also depends on `deserializeEntry`.
   
   I think the type of exception won't affect what caller should do. Caller may want to deal with FNFE specifically, but still do the same for others, and even for FNFE it's safer to follow the guide on NOTE.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656625881


   **[Test build #125603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125603/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r471885284



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -173,37 +177,64 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
   override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException(
     s"Cannot purge as it might break internal state.")
 
+  /**
+   * Apply function on all entries in the specific batch. The method will throw
+   * FileNotFoundException if the metadata log file doesn't exist.
+   *
+   * NOTE: This doesn't fail early on corruption. The caller should handle the exception
+   * properly and make sure the logic is not affected by failing in the middle.

Review comment:
       I'm not sure how this class ensures callers are following the guide. Did you mean we'd like to test this behavior with derived classes (file stream source/sink) log? Or we'd like to test this behavior with test-purpose implementation of CompactibleFileStreamLog?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652831471






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656408649


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125521/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xuanyuanking commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
xuanyuanking commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r457778586



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala
##########
@@ -106,10 +106,8 @@ abstract class CompactibleFileStreamLog[T <: AnyRef : ClassTag](
     interval
   }
 
-  /**
-   * Filter out the obsolete logs.
-   */
-  def compactLogs(logs: Seq[T]): Seq[T]
+  /** Determine whether the log should be retained or not. */
+  def shouldRetain(log: T): Boolean

Review comment:
       Maybe not strong enough. IMO if we are able to avoid breaking the third-party libs, we'd better do so. Yep, let's hear more voice.
   Great thanks for your patience and reply. It's the last concern left, other changes for this PR LGTM. I do think we need this improvement!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-655253929






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28904:
URL: https://github.com/apache/spark/pull/28904#discussion_r468306410



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##########
@@ -97,18 +97,13 @@ class FileStreamSinkLog(
     s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was $defaultCompactInterval) " +
       "to a positive value.")
 
-  override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
-    val deletedFiles = logs.filter(_.action == FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
-    if (deletedFiles.isEmpty) {
-      logs
-    } else {
-      logs.filter(f => !deletedFiles.contains(f.path))
-    }
-  }
+  override def shouldRetain(log: SinkFileStatus): Boolean = true
 }
 
 object FileStreamSinkLog {
   val VERSION = 1
+  // TODO: This action hasn't been used from the introduction. We should just remove this.

Review comment:
       Add a JIRA number in the comment?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656408644


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-658519488


   **[Test build #125875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125875/testReport)** for PR 28904 at commit [`006c028`](https://github.com/apache/spark/commit/006c028cf917bcfa4e78955a280e815418cbc2be).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675298123






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-650460286


   I've been running sustain tests (still running) and here's some observation:
   
   - SPARK-30946 + SPARK-30462 just wrote compact batch 7309 (73,100,000 entries) which size is 3.5G, and RES has been less than 2G (It increases a bit. Maybe also better to run with smaller heap memory.) It took 169517 ms (2.8 mins) to build a compact batch 7309, which is similar to build a compact batch 479 (4,800,000) without SPARK-30946. (15x processed)
   - SPARK-30462 itself just wrote compact batch 1739 (17,390,000 entries) which size is 3.1G, and RES has been less than 2G.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-656617845


   **[Test build #125596 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125596/testReport)** for PR 28904 at commit [`8034ca4`](https://github.com/apache/spark/commit/8034ca4bb8401f2070b8bb8048722255d145fdec).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654468511


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-652182305


   org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)
   
   This seems to be failing frequently. I'll see other build result being run via 124719.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-675248798






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-654439146


   retest this, please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28904: [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-648123665






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org