You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/12 04:04:43 UTC

[GitHub] [spark] LuciferYang opened a new pull request, #38214: [SPARK-40765][SQL] Optimize redundant fs operations in `CommandUtils#calculateSingleLocationSize#getPathSize` method

LuciferYang opened a new pull request, #38214:
URL: https://github.com/apache/spark/pull/38214

   ### What changes were proposed in this pull request?
   This pr change the 2nd input parameter from `Path` to `FileStatus to avoid redundant `fs.getFileStatus(path)` in each recursive call.
   
   
   ### Why are the changes needed?
   Reduce one dfs operation.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Pass Github Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276963870

   Ignore the previous
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276312561

   I see so it saves one call, not one call per recursion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276300373

   <img width="648" alt="image" src="https://user-images.githubusercontent.com/1475305/195373375-9b721841-fc21-45f1-b88a-85b499c01237.png">
   
   https://github.com/apache/spark/blob/0dd78341e64a7e146413156834448707f16bba83/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala#L116-L132
   
   Line 122 It is a recursive call. The original logic is pass in `status.getPath`, and then execute `val fileStatus = fs.getFileStatus(status.getPath)` (Line 117).
   
   After change, the recursive call reuse the incoming status.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276311354

   > I don't see the same status retrieved twice in the original
   
   Before recursion, there is an exist `status` instance,  then the recursive call will call `getStatus` for the exist `status`.`getPath`,  Isn't this an unnecessary call?
   
   If there is no recursion, there is no saving.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275973420

   Yes, one `fs.getFileStatus(path)` operation is saved for each recursive call


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276950904

   Thanks @srowen @HyukjinKwon and @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276354759

   For example , the data path is `/a/b/c/d.txt`:
   
   Before
   
   - recursion-0: pass in `Path(/a)` -> ` fs.getFileStatus(Path(/a))` -> `fs.listStatus(Status(/a))` -> `Status(/a/b)` -> call `getPathSize(Status(/a/b).getPath)`
   
   - recursion-1: pass in `Path(/a/b)`  ->  `fs.getFileStatus(Path(/a/b))` ->  `fs.listStatus(Status(/a/b))` -> `Status(/a/b/c)` -> call `getPathSize(Status(/a/b/c).getPath)`
   
   - recursion-2: pass in `Path(/a/b/c)` ->  `fs.getFileStatus(Path(/a/b/c))` -> `fs.listStatus(Status(/a/b/c))` -> `Status(/a/b/c/d.txt)` -> call `getPathSize(Status(/a/b/c/d.txt).getPath)`
   
   - recursion-3: pass in path of `Path(/a/b/c/d.txt)` -> `fs.getFileStatus(Path(/a/b/c/d.txt))` -> `Status(/a/b/c/d.txt)` -> return `size`
   
   After
   
   fs.getFileStatus(Path(/a))
   
   - recursion-0: pass in `Status(/a)` -> `fs.listStatus(Status(/a))` -> `Status(/a/b)` -> call `getPathSize(Status(/a/b))`
   
   - recursion-1: pass in `Status(/a/b)` -> `fs.listStatus(Status(/a/b))` -> `Status(/a/b/c)` -> call `getPathSize(Status(/a/b/c))`
   
   - recursion-2: pass in `Status(/a/b/c)` -> `fs.listStatus(Status(/a/b/c))` -> > `Status(/a/b/c/d.txt)` -> call `getPathSize(Status(/a/b/c/d.txt))`
   
   - recursion-3: pass in `Status(/a/b/c/d.txt)` -> return `size`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276474003

   Oh yes I think that's it. The other call is not in the same recursive loop. Ok I get it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276472661

   <img width="615" alt="image" src="https://user-images.githubusercontent.com/1475305/195402745-99aa2d9f-1e68-4ba9-a574-02e40b5190d9.png">
   
   
   Is it because of code folding? After change, only getFileStatus will be made for `/a`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276963399

   I need to know more about the rules for defining error-class, so close first
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276389487

   From the example:
   
   - `fs.getFileStatus(Path(/a))` not saved, but no additional calls are added
   - `fs.getFileStatus(Path(/a/b))`,`fs.getFileStatus(Path(/a/b/c))` and `fs.getFileStatus(Path(/a/b/c/d.txt))` saved
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275918414

   so it can save some IO operations?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
URL: https://github.com/apache/spark/pull/38214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276475810

   > Oh yes I think that's it. The other call is not in the same recursive loop. Ok I get it
   
   Yes, and I finally know why you had doubts before :)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276359784

   @srowen Is the above example helpful?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276317638

   The saved times are related to the directory level and the number of directories in each level


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276320786

   I may have to think about it more, but still don't see how this saves much as you have _added_ a new call to getFileStatus too. Right? how can it save all the calls but one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276468566

   OK, I'm still not seeing how they are saved, because in the recursive call, getFileStatus is called. After, you are still calling this on /a/b, /a/b/c, etc. Right? That is line 138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276938912

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275557093

   cc @HyukjinKwon @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276288007

   I don't quite see how this saves a call; the caller makes an fs call instead of the callee, but the number of calls seems like it must be the same?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276303206

   But the caller had to add a new call to get the status. It seems to just be moved around. I don't see the same status retrieved twice in the original, so wondering how we can actually avoid a call here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276315937

   Every time a recursive call occurs, it will be saved once (the initial call is not counted, and the initial call is not saved)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org