You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/12 16:39:09 UTC

[GitHub] [spark] guanziyue opened a new pull request, #37498: [SPARK-40058][CORE] Avoid filter file path twice in HadoopFSUtils

guanziyue opened a new pull request, #37498:
URL: https://github.com/apache/spark/pull/37498

   ### What changes were proposed in this pull request?
   Refactor path filter logic in HadoopFSUtils to avoid the same filter logic is applied to a file multiple time. Method listLeafFiles is called recursively. Especially, this filter will be used in single thread on all files at driver side. This will lead to a performance issue when the filter logic is heavy. 
   
   
   ### Why are the changes needed?
   Apply filter only on filestatus as soon as they are firstly met.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   No test was added as such change is simple enough.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1215044522

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1214524050

   > nit.
   > 
   > * @WangGuangxin 's PR seems to be opened 2 days ago before this PR and smaller than this PR. There is no difference in the logic.
   > * Also, [SPARK-40035](https://issues.apache.org/jira/browse/SPARK-40035) is reported before [SPARK-40058](https://issues.apache.org/jira/browse/SPARK-40058) (this PR's JIRA).
   > 
   > In this case, technically, we used to keep the smallest JIRA ID (and its earlier code contribution) in the community. And, [SPARK-40058](https://issues.apache.org/jira/browse/SPARK-40058) and this PR is supposed to be closed as `Duplicated`.
   > 
   > Just a question. @WangGuangxin and @guanziyue , is there any coordination between you in order to keep this one?
   
   Yes, we have been communicated offline, just go ahead please. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils
URL: https://github.com/apache/spark/pull/37498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1213660182

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1214485389

   nit.
   - @WangGuangxin 's PR seems to be opened 2 days ago before this PR and smaller than this PR. There is no difference in the logic.
   - Also, SPARK-40035 is reported before SPARK-40058.
   
   In this case, technically, we used to keep the smallest JIRA ID (and its earlier code contribution) in the community. And, SPARK-40058 and this PR is supposed to be closed as `Duplicated`.
   
   Just a question. @WangGuangxin and @guanziyue , is there any coordination between you in order to keep this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] guanziyue commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
guanziyue commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1214293983

   > (Drive-by comment / note to other reviewers): it looks like this PR and #37467 are aiming to solve the same problem (and thus are duplicates). We should pick a preferred approach and close one of the PRs / JIRAs as a duplicate.
   
   Thanks for your remind and wangguangxin. Could we continue with this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JoshRosen commented on pull request #37498: [SPARK-40058][CORE] Avoid filter file path more than once in HadoopFSUtils

Posted by GitBox <gi...@apache.org>.
JoshRosen commented on PR #37498:
URL: https://github.com/apache/spark/pull/37498#issuecomment-1213430361

   (Drive-by comment / note to other reviewers): it looks like this PR and https://github.com/apache/spark/pull/37467 are aiming to solve the same problem (and thus are duplicates). We should pick a preferred approach and close one of the PRs / JIRAs as a duplicate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org